Category Archives: testing

Why Common Core Standards Will Succeed

Even though there is little evidence that state standards have increased student academic achievement since the 1980s, the District of Columbia and 45 states have embraced the Common Core–(see here and here).

Even though there is little evidence that countries with national standards do not necessarily score higher on international tests than nations without national standards, many states have already aligned their standards to textbooks, lessons, and tests– (see here and here).

Even though there is little evidence Common Core standards will produce the skilled and knowledgeable graduates that employers and college teachers have demanded of public schools, most state and federal officials have assured parents and taxpayers that the new standards and tests will do exactly that–(see here and here).

Even though there is little evidence that state and national officials have resolved tough issues in the past when it came to curriculum standards (e.g., supplying professional development for teachers and principals, appropriate instructional materials, determining whether teachers altered their practices) much less reduce the inevitable problems that will occur in implementing the Common Core standards (e.g., resources for computer-based testing), cheerleaders  continue to beat the drums for national standards–(see here and here)

images-1

butterfly image standards

With all of these “even though”s  (and there are more), Common Core standards will succeed. How can that be?

The short answer is that evidence of success doesn’t matter much to those who make policy decisions. Oh sure, decision-makers have to mention evidence along with research studies and they do but not much when it comes to Common Core standards. Instead what they talk about are failing schools, the low-quality of teaching and how unless academic standards are raised–drum roll here at mention of Common Core–the economy will sink under the weight of graduates unprepared for an information-based workplace. Getting everyone to go to college, especially minority and poor students is somehow seen as a solution to economic, political, and social inequalities that have persistently plagued the U.S. for the past four decades

Reform-minded policy elites–top federal and state officials, business leaders, and their entourages with unlimited access to media (e.g., television, websites, print journalism)–use these talking points to engage the emotions and, of course, spotlight public schools as the reasons why the U.S. is not as globally competitive as it should be. By focusing on the Common Core, charter schools, and evaluating teachers on the basis of student test scores, these decision-makers have shifted public attention away from fiscal and tax policies and economic structures that not only deepen and sustain poverty in society but also reinforce privilege of the top two percent of wealthy Americans. Policy elites have banged away unrelentingly at public schools as the source of national woes for decades.

National, state, and local opinion-makers in the business of school reform know that what matters is not evidence, not research studies, not past experiences with similar reforms–what matters is the appearance of success. Success is 45 states adopting standards, national tests taken by millions of students, and public acceptance of Common Core. Projecting positive images (e.g., the film Waiting for Superman, “everyone goes to college”) and pushing myths (e.g., U.S schools are broken, schools are an arm of the economy) that is what counts in the theater of school reform.

Within a few years–say, by 2016, a presidential election year–policy elites will declare the new standards a “success” and, hold onto your hats, introduce more and better and standards and tests.

This happened before with minimum competency tests in the 1970s. By 1980, thirty-seven states had mandated these tests for grade-to-grade promotion and high school graduation. The Nation at Risk report (1983) judged these tests too easy since most students passed them. So goodbye to competency tests. That happened again in the 1990s with the launching of upgraded state curriculum standards (e.g., Massachusetts) and then NCLB and later Common Core came along. It is happening now and will happen again.

Policy elites see school reform as a form of theater. Blaming schools for serious national problems, saying the right emotionally-loaded words, and giving the appearance of doing mighty things to solve the “school” problem matter far more than hard evidence or past experiences with similar reforms.

41 Comments

Filed under school reform policies, testing

Buying iPads, Common Core Standards, and Computer-Based Testing

The tsunami of computer-based testing for public school students is on the horizon. Get ready.

For adults, computer-based testing has been around for decades. For example, I have taken and re-taken the California online test to renew my driver’s license twice in the past decade. To get certified to drive as a volunteer driver for Packard Children’s Hospital in Palo Alto, I had to read gobs of material about hospital policies and federal regulations on confidentiality before taking a series of computer-based tests. To obtain approval from Stanford University for a research project of which I am the principal investigator and where I would interview teachers and observe classrooms, I had to read online a massive amount of material on university regulations about consent of subjects to participate, confidentiality, and handling of information gotten from interviews and classroom observations.  And again, I took online tests that I had to pass in order to gain approval from the University to conduct research.  Beyond the California Department of Motor Vehicles, Children’s Hospital, and Stanford University, online assessment has been a staple in the business sector from hiring through employee evaluations.  So online testing is already part of adult experiences

What about K-12 students?  Increasingly, districts are adopting computer-based testing. For example, Measures of Academic Progress, a popular test used in many districts is online. Speeding up this adoption of computer-based testing is the Common Core Standards and the two consortia that are preparing assessments for the 45 states on the cusp of implementing the Standards. Many states have already mandated online testing for their own standardized tests to get prepared for impending national  assessments. These tests will require students to have access to a computer with the right hardware, software, and bandwidth to accommodate online testing by 2014-2015 (See here, here, and here).

There are many pros and cons with online testing as, say, compared with paper-and-pencil tests. But whatever those pros are for paper-and-pencil tests, they are outslugged and outstripped by the surge of buying new devices and piloting of computer-based tests to get ready for Common Core assessments (see here and here). Los Angeles Unified school district, the second largest in the nation, just signed a $50 million contract with Apple for  iPads. One of the key reasons to buy these devices for the initial rollout for 47 schools was Common Core standards and assessment. Each iPad comes with an array of pre-loaded software compatible with the state online testing system and impending national assessments. The entire effort is called The Common Core Technology Project.

The best (and most recent) gift to the hardware and software industry has been the Common Core standards and assessments. At a time of fiscal retrenchment in school districts across the country when schools are being closed and teachers are let go, many districts have found the funds to go on shopping sprees to get ready for the Common Core.

And here is the point that I want to make. The old reasons for buying technology have been shunted aside for a sparkling new one. Consider that for the past three decades the rationale for buying desktop computers, laptops, and now tablets has been three-fold:

1. Make schools more efficient and productive so that students learn more, faster, and better than they had before.

2. Transform teaching and learning into an engaging and active process connected to real life.

3. Prepare the current generation of young people for the future workplace.

After three decades of rhetoric and research, teachers, principals, students, and vendors have their favorite tales to prove that these reasons have been achieved. But for those who want more than Gee Whiz stories, who seek a reliable body of evidence that shows students learning more, faster, and better, that shows teaching and learning to have been transformed, that using these devices have prepared the current generations for actual jobs—well, that body of evidence is missing for each of these traditional reasons to buy computers.

With Common Core standards adopted, the rationale for getting devices has shifted. No longer does it  matter whether there is sufficient evidence to make huge expenditures on new technologies. Now, what matters are the practical problems of being technologically ready for the new standards and tests in 2014-2015: getting more hardware, software, additional bandwidth, technical assistance, professional development for teachers, and time in the school day to let students practice taking tests.

Whether the Common Core standards will improve student achievement–however measured–whether students learn more, faster, and better–none of this matters in deciding on which vendor to use. It is not whether to buy or not. The question is: how much do we have and when can we get the devices. That is tidal wave on the horizon.

17 Comments

Filed under technology, testing

Cheating Scandals Reaffirm, Not Diminish, Testing

Not until the trials (or plea bargains) are over, will a verdict be rendered on former Superintendent Beverly Hall’s guilt or innocence in what is called the Atlanta cheating scandal. Hall’s indictment follows on the heels of finding El Paso Superintendent Lorenzo Garcia guilty last Fall. He is now serving three and a half years in jail (see here and here).

Even before a judge or jury decides on her guilt or innocence, anti-testing groups, feeding on Atlanta, El Paso, and the investigation of tampering with test scores under Washington, D.C. school chief, Michelle Rhee, have grabbed the case to further their cause. Moreover, over the years, journalists have uncovered oddities in test scores jumping sky-high in one year in other districts across the nation.

Foes of standardized tests feel the rush of adrenalin in saying that these examples of dishonest adults raising student test scores to receive applause and cash awards are pervasive. Defenders of standardized testing and accountability, however, see the  cheating as exceptions, as a few rotten apples in a barrel full of worm-free ones. Most educators, advocates of test-driven accountability say, are decent, hard working professionals who play by the rules and can be trusted to do the right thing.

In this volleying back-and-forth between advocates and foes of standardized testing,  school scandals have been compared to cheating in baseball, bicycle racing, and other sports.

From Mark McGuire‘s stained home run record to Tour de France winner Lance Armstrong‘s admission that he doped while racing, these and other sports have come under a dark cloud of suspicion–an outcome damaging to top athletes, companies dependent upon income derived from professional sports, fans turning into cynics, and disappointed youth who only want to play the game by the rules.

Cheating in both sports and schools can be traced to the unleashed and fierce competition in performing better and better to gain ever-larger rewards. Professional sports are money machines and being a top performer is rewarded handsomely; scores on international tests, ranking schools within a state and district based on performance, a broader array of school choices, and federal regulations in No Child Left Behind and Race to the Top  have ratcheted upward intense pressure to beat  state tests.

Also common to school cheating and drug-drenched sports is betraying the public trust to gain personal advantage.  When adults erase student answers and professional athletes take illegal drugs to enhance performance, such acts erode the faith that adults and youth have in social institutions being fair.

Another common feature is the unshaken confidence that current authorities have in written and computerized tests assessing student learning and drug tests determining whether athletes are cheating. When cheating is uncovered, few decision-makers question the tests. Tighter security and better tests are the solutions.

*Few decision-makers question whether there might be something wrong in professional athletics (i.e., expansion of baseball, football, hockey, and basketball leagues and over-the-top competition for more money).

*Few decision-makers question whether most toddlers and young children from low-income families should be tested especially since they bring to school very different strengths and weaknesses than children from middle and upper-income homes. Or that such early testing of young children squeezes inequities into judgments of what they can and cannot do in preschool and elementary school classrooms.

*Few decision-makers question the national obsession with student test scores as the correct metric to judge schools, teachers, and students.

This deep reluctance to question powerful interests invested in socioeconomic structures and cultures in which cheating occurs is why I believe that standardized tests in schools, like drug testing in sports, will be reaffirmed rather than overturned. There will be continuing challenges–as there should be–but standardized testing will remain rock-solid. Why?

First, note that most of the cheating incidents have been largely in districts where high percentages of poor and minority students attend school. Sure, there are exceptions but when you look closely at where dishonesty is found, those charters and regular public schools enroll large numbers of children from low-income families. I have yet to find any district school boards, investigators, charter school leaders or policymakers recommend examining the tests to see if they do what they are supposed to do or, after conducting such an examination, finding unworthy tests and getting rid of them. Yes, there have been protests by educators, students, and middle- and upper-middle class families against too much standardized testing (see here and here). These protests have led to occasional boycotts but none have occurred, to my knowledge, in poor neighborhoods. If anything, there is a reaffirmation of tests, calls for greater security, and plaudits for any whistle-blowers.

The point is that these tests sort students and schools by scores that  reinforce rather than erase existing gaps in achievement. And sorting is necessary to determine who, beginning at the age of four, shall climb each rung of that ladder reaching college. The system of private and public schooling requires such tests to distinguish high achievers from others. If the tests were really that accurate in making such distinctions across children and youth of being smart on paper, with people, and in life now and later, then, perhaps we need such tests . But that is not the case now… by a long shot.

Second, to underscore the above point, consider the experience of cheating on the SAT. After a scandal revealed that high-scoring individuals with fake IDs were paid to take the SAT test, Educational Testing Service tightened security at test sites. No challenges of the test itself occurred. SAT scores remain crucial for college admission and no school boards, teachers, or parent groups called for the end of the test.

Count on cheaters getting more clever and investigators still hunting them down. Amid increasing numbers of cheating incidents, standardized tests will be challenged, maybe the numbers even reduced, but nonetheless, they will reign for the immediate future.

16 Comments

Filed under leadership, school reform policies, testing

From Quill Pens to Computer Adaptive Testing: Old and New Technological Devices

There are many definitions of instructional technology. One concentrates on devices teachers use in classrooms.  Another definition focuses on the different ways that teachers have used such devices as tools to advance learning in lessons. Even other definitions frame technology as processes, ways of organizing classrooms, schools, and districts.

I examine the second definition in this post: the connection between writing tools students used and the perpetual demand over the past two millennia of teachers in every culture to find out what students have learned. Here I consider the quill and steel-tipped pen, pencil, ball-point pen, and yes, the computer.

I begin with the quill pen.

d014a76fe2c5f2b947f95e1850998022

Here is how Robert Travers (1983, pp. 97-98) described quill pens.

The quill pen was first mentioned in the writings of Saint Isodore of Seville in the seventh century…. The quill seems to have been by far the best writing instrument invented in its time for it displaces all other forms. It became the main instrument used in schools, apart from the slate…. Even in the late 1800s, the quill pen was still the most widely used instrument for writing. Quills [came] from the wings of geese, but swan quills were also highly valued. For fine work, the quills of crows were sometimes used.

In 1809, an inventor, Joseph Bramah, developed a machine for cutting quills into lengths, and the short lengths were then inserted into a wooden holder….The separation of the point and the holder led to many inventions, and one of these was the [metal-tipped pen]….The first factory for the mass production of the steel pen was established in New Jersey in 1870….

When the steel pen entered education, a revolution in school practice [occurred]. Writing with the quill had been a slow, unhurried art…. [T]he writer had to stop frequently in order to reshape and sharpen the quill. Since writing was a slow art, pride was taken in it….The steel pen changed that. The steel pen made it possible to write continuously over long periods. There was ever increasing pressure on the pupil to produce written material in quantity. The new medium for written work then became used for examinations, which became substitutes for the form of oral examination provided by the recitation [where students would be quizzed in public for their knowledge]….

By 1890, students had become so used to the steel pen that examinations were commonly administered using this writing instrument as a tool to produce rapidly written answers.

images

In the Pittsburgh (PA) public schools as an elementary school student in 1940 at a now demolished Minersville elementary school, I sat at one of the above desks with the hole for the then-defunct inkwell.

What about pencils? Like metal-tipped pens, mass-produced pencils did not appear in most classrooms until the early decades of the 20th century. And with pencils, teachers assessed what students learned through hand-written homework, quizzes, essays, and multiple-choice tests (introduced in the U.S. during World War I). These inexpensive devices–mass-produced ball-point pens arrived in schools in the 1940s—made assessing students’ knowledge inexpensive–after all, no one pays students to take tests or for lost time learning–and efficient in judging promotion, retention, graduation, and other high-stakes outcomes.

Now arrives computer adaptive testing (CAT). Used a great deal in the private sector for employment and other purposes, over the past few decades, computerized testing has entered schools. In Measures of Academic Progress (MAP), for example, students sit in front of computer screens and take tests that are tailored to their ability. When a student answers an item (usually multiple-choice) correctly, then the student is given a harder item to answer. If the student gives a wrong answer, then the screen shows an easier question. This goes on until the computer bank runs out of items to administer students or the computer has sufficient information to give the student a score. Whichever happens first, then the test is over.

Highly touted by promoters and vendors–see McGraw-Hill YouTube segment for an example of hype–CAT is part of the package that new national tests accompanying Common Core standards will include by 2014. There are, as with any new technological device, clear advantages and disadvantages of this form of assessment (see Computer Adaptive Testing).

Like quill  and steel-tipped pens dipped in ink, pencils, and ballpoint pens, here is another technological device that is being bent toward finding out what students know. Ideally, of course, there would be no need for CAT or the mountain high summative tests currently in vogue across the country were the nation’s teachers sufficiently trusted to use the many ways teachers assess daily what their students know and can do. And further, for districts to build and increase teacher knowledge and skills in assessment That kind of time investment in teacher knowledge and skills and the accompanying trust in teachers and schools to assess and report the results are, sad to say, missing-in-action.

So watch computer adaptive testing become the new steel-tipped pen of the late-19th century.

10 Comments

Filed under how teachers teach, technology use, testing

Chicago Teachers’ Strike, Performance Evaluation, and School Reform (Jack Schneider and Ethan Hutt)

For the fifth consecutive day in Chicago, nearly 30,000 teachers are out on strike. At issue are many of the contractual details that typically complicate collective bargaining—pay, benefits, and the length of the work day. But the heart of the dispute roiling the Windy City is something relatively new in contract talks: a complicated statistical algorithm.

District leaders in Chicago, following the lead of reformers in cities nationwide, are pushing for a “value-added” evaluation system. Unlike traditional forms of evaluation, which rely primarily on classroom observations, policymakers in Chicago propose to quantify teacher quality through the analysis of student achievement data. Using cutting-edge statistical methodologies to analyze standardized test scores, the district would determine the value “added” by each teacher and use that information as a basis for making personnel decisions.

Teachers are opposed to this approach for a number of reasons. But educational researchers are generally opposed to it, too, and their reasoning is far less varied: value-added evaluation is unreliable.

As researchers have shown, value-added methodologies are still very much works-in-progress. Scholars like Heather Hill have found that value-added scores correlate not only with quality of instruction, but also with the population of students they teach. Researchers examining schools in Palm Beach, Florida discovered that more than 40 percent of teachers scoring in the bottom decile one year, according to value-added measurements, somehow scored in the top two deciles the following year. And according to a recent Mathematica study, the error rate for comparing teacher performance was 35 percent. Such figures could only inspire confidence among those working to suspend disbelief.

And yet suspending disbelief is exactly what reformers are doing. Instead of slowing down the push for value-added, they’re plowing full steam ahead. Why?

The promise of a mechanized quality-control process, it turns out, has long captivated education reformers. And while the statistical algorithm in question right now in Chicago happens to be quite new, reformer obsession with ostensibly standardized, objective, and efficient means of gauging value is, in fact, quite old. Unfortunately, as the past reveals, plunging headlong into a cutting-edge measurement technology is also quite problematic.

Example 1:
Nearly a century ago, school leaders saw a breakthrough in measurement technology as a way of measuring teacher quality. By using newly-designed IQ tests to assess “native ability,” school administrators could translate student scores on standardized tests into measures of teacher effectiveness. Of course, not everyone was on board with this effort. As one school superintendent noted, some educators were concerned “that the means for making quantitative and qualitative measures of the school product” were “too limited to provide an adequate basis for judgment.” But the promise of the future was too tempting and, as he argued, though it was “impossible” to measure teacher quality rigorously, “a good beginning” had been made. Reformers plowed ahead.

The IQ movement was deeply flawed. The instruments were faulty and culturally-biased. The methodology was inconsistent and poorly applied. And the interpretations were horrifying. “If both parents are feeble-minded all the children will be feeble-minded,” wrote H.H. Goddard in 1914. “Such matings,” he reasoned, “should not be allowed.” Others drew equally shocking conclusions. E.G. Boring, a distinguished psychologist of the period, wrote in 1920 that “the average man of many nations is a moron.” The average Italian-American adult, he calculated, had a mental age of 11.01 years. African-Americans were at the bottom of his list, with an average mental age of 10.41.

Value-added proponents like to make the argument that “some data is better than no data.” Yet in the case of the mental testing movement, that was patently false. For the hundreds of thousands of students
tracked into dead-end curricula, to say nothing of the forced sterilization campaigns that took place outside of schools, reform was imprudent and irresponsible.

But one need not go back so far into the educational past for examples of half-baked quality-control reforms peddled by zealous policymakers.

Example 2:
In the 1970s, 37 states hurriedly adopted “minimum competency testing” legislation and implemented “exit examinations,” ignoring the concerns of experts in the field. As one panel of scholars observed, the plan was “basically unworkable” and exceeded “the present measurement arts of the teaching profession.” Reformers, however, were not easily dissuaded.

The result of the minimum competency movement was the development of a high stakes accountability regime and years of litigation. Reformers claimed that the information revealed by such tests would provide the sunlight and shame that schools needed to improve. Yet while they awaited that outcome, thousands of students suffered the indignity of being labeled “functionally illiterate,” were forced into remedial classes, and had their diplomas withheld despite having enough units to graduate—all on the basis of a test that leading scholars described as “an indefensible technology.”

Contrary to what reformers claimed, the information provided by such deeply flawed tests did little to improve students’ learning or the quality of schools.

Today’s policymakers, like those of the past, want to adopt new tools as swiftly as possible. Even flawed value-added measures, they argue, are better than nothing. Yet the risks of early adoption, as the past reveals, can far outweigh the rewards. Simply put, acting rashly on incomplete information makes mistakes more costly than necessary.

Today’s value-added boosters believe themselves to be somehow different—acting on better incomplete information. Yet the idea that incomplete information can be good strains credulity.

Good technologies do tend to improve over time. And if advocates of value-added models are confident that they can work out the kinks, they should continue to judiciously experiment with them. In the meantime, however, such models should be kept out of all high-stakes personnel decisions. Until we can make them sufficiently work, they shouldn’t count.

________________________________

Jack Schneider is an Assistant Professor of Education at the College of the Holy Cross and author of Excellence For All: How a New Breed of Reformers Is Transforming America’s Public Schools. Ethan Hutt is a doctoral candidate at the Stanford University School of Education and has been named “one of the world’s best emerging social entrepreneurs” by the Echoing Green Foundation.

14 Comments

Filed under how teachers teach, school reform policies, testing

Testing, Testing, and Testing: More Cartoons

The U.S. has tests galore. Driving, alcohol, steroids, DNA, citizenship, blood,  pregnancy–and on and on. Most serve a specific purpose and carry personal consequences if one passes or fails. School tests, however, to pass a course, to be promoted to another grade, to graduate and to judge whether the school is satisfactory or on probation have proliferated dramatically in the past three decades. Opinions are split among Americans about these tests.

Surveys report that most teachers (but by no means all) believe that there is too much standardized testing. Some parents have mobilized to boycott annual tests. Most respondents to opinion polls, however, support curriculum standards, accountability, and, yes, state tests.

Of the many cartoons on testing that I have located, most reflect the opinion that there is too much testing and too much is made of the results. I have found very few–none that I can recall or that I have posted–endorsing standardized tests. Here is a sampling of those cartoons.

For those readers who wish to see previous monthly posts of cartoons, see: “Digital Kids in School,” “Testing,” “Blaming Is So American,”  “Accountability in Action,” “Charter Schools,” and “Age-graded Schools,” Students and Teachers, Parent-Teacher Conferences, Digital Teachers, and Addiction to Electronic Devices.

11 Comments

Filed under testing

Three Important Distinctions In How We Talk About Test Scores (Matt DiCarlo)

“Matthew Di Carlo is a senior fellow at the non-profit Albert Shanker Institute in Washington, D.C. His current research focuses mostly on education policy, but he is also interested in social stratification, work and occupations, and political attitudes/behavior.”  The post appeared May 25, 2012

In education discussions and articles, people (myself included) often say “achievement” when referring to test scores, or “student learning” when talking about changes in those scores. These words reflect implicit judgments to some degree (e.g., that the test scores actually measure learning or achievement). Every once in a while, it’s useful to remind ourselves that scores from even the best student assessments are imperfect measures of learning. But this is so widely understood – certainly in the education policy world, and I would say among the public as well – that the euphemisms are generally tolerated.

And then there are a few common terms or phrases that, in my personal opinion, are not so harmless. I’d like to quickly discuss three of them (all of which I’ve talked about before). All three appear many times every day in newspapers, blogs, and regular discussions. To criticize their use may seem like semantic nitpicking to some people, but I would argue that these distinctions are substantively important and may not be so widely-acknowledged, especially among people who aren’t heavily engaged in education policy (e.g., average newspaper readers).

So, here they are, in no particular order.

In virtually all public testing data, trends in performance are not “gains” or “progress.” When you tell the public that a school or district’s students made “gains” or “progress,” you’re clearly implying that there was improvement. But you can’t measure improvement unless you have at least two data points for the same students – i.e., test scores in one year are compared with those in previous years. If you’re tracking the average height of your tomato plants, and the shortest one dies overnight, you wouldn’t say that there had been “progress” or “gains,” just because the average height of your plants suddenly increased.

Similarly, almost all testing trend data that are available to the public don’t actually follow the same set of students over time (i.e., they are cross-sectional). In some cases, such as NAEP, you’re comparing a sample of fourth and eighth graders in one year with a different cohort of fourth and eighth graders two years earlier. In other cases, such as the results of state tests across an entire school, there’s more overlap – many students remain in the sample between years – but there’s also a lot of churn. In addition to student mobility within and across districts, which isoften high and certainly non-random, students at the highest tested grade leave the schools (unless they’re held back), while whole new cohorts of students enter the samples at the lowest tested grade (in middle schools serving grades seven and eight, this means that half the sample turns over every year).

So, whether it’s NAEP or state tests, you’re comparing two different groups of students over time. Often, those differences cannot be captured by standard education variables (e.g., lunch program eligibility), but are large enough to affect the results, especially in smaller schools (smaller samples are more prone to sampling error). Calling the differences between years “gains/progress” or “losses” therefore gives a false impression; at least in part, they are neither – reflecting nothing more than variations between the cohorts being compared.

 Proficiency rates are not “scores.” Proficiency or other cutpoint-based rates (e.g., percent advanced) are one huge step removed from test scores. They indicate how many students scored above a certain line. The choice of this line can be somewhat arbitrary, reflecting value judgments and, often, political considerations as to the definition of “proficient” or “advanced.” Without question, the rates are an accessible way to summarize the actual scale scores, which aren’t very meaningful to most people. But they are interpretations of scores, and severely limited ones at that.*

Rates can vary widely, using the exact same set of scores, depending on where the bar is set. In addition, all these rates tell you is whether students were above or below the designated line – not how far above it or below it they might be. Thus, the actual test scores of two groups of students might be very different even though they have the same proficiency ranking, and scores and rates can move in opposite directions between years.

To mitigate the risk of misinterpretation, comparisons of proficiency rates (whether between schools/districts or over time) should be accompanied by comparisons of average scale scores whenever possible. At the very least, the two should not be conflated.**

Schools with high average test scores are not necessarily “high-performing,” while schools with lower scores are not necessarily “low-performing.” As we all know, tests don’t measure the performance of schools. They measure (however imperfectly) the performance of students. One can of course use student performance to assess that of schools, but not with simple average scores.

Roughly speaking, you might define a high-performing school as one that provides high-quality instruction. Raw average test scores by themselves can’t tell you about that, since the scores also reflect starting points over which schools have no control, and you can’t separate the progress (school effect) from the starting points. For example, even the most effective school, providing the best instruction and generating large gains, might still have relatively low scores due to nothing more than the fact the students it serves have low scores upon entry, and they only attend the schools for a few years at most. Conversely, schools with very high scores might provide poor instruction, simply maintaining (or even decreasing) the already stellar performance levels of the students it serves.

We very clearly recognize this reality in how we evaluate teachers. We would never judge teachers’ performance based on how highly their students score at the end of the year, because some teachers’ students were higher-scoring than others’ at the beginning of the year.

Instead, to the degree that school (and teacher) effectiveness can be assessed using testing data, doing so requires growth measures, as these gauge (albeit imprecisely) whether students are making progress, independent of where they started out and other confounding factors. There’s a big difference between a high-performing school and a school that serves high-performing students; it’s important not to confuse them.

_______________________________________

* Although this doesn’t affect the point about the distinction between scores and rates, it’s fair to argue that scale scores also reflect value judgments and interpretations, as the process by which they are calculated is laden with assumptions – e.g., about the comparability of content on different tests.

** Average scores, of course, also have their strengths and weaknesses. Like all summary statistics, they hide a lot of the variation. And, unlike rates, they don’t provide much indication as to whether the score is “high” or “low” by some absolute standard (thus making them very difficult to interpret), and they are usually not comparable between grades. But they are a better measure of the performance of the “typical student,” and as such are critical for a more complete portrayal of testing results, especially viewed over time.

18 Comments

Filed under testing