Category Archives: testing

Principals And Test Scores

I read a recent blog from two researchers who assert that principals can improve students’ test scores. The researchers cite studies that support their claim (see below). These researchers received a large grant from the Wallace Foundation to alter their principal preparation program to turn out principals who can, indeed, raise students’ academic achievement.

I was intrigued by this post because as a district superintendent I believed the same thing and urged the 35 elementary and secondary principals I supervised—we met face-to-face twice a year to go over their annual goals and outcomes and I spent a morning or afternoon at the school at least once a year—to be instructional leaders and thereby raise test scores. Over the course of seven years, however, I saw how complex the process of leading a school is, the variation in principals’ performance, and the multiple roles that principals play in his or her school to engineer gains on state tests (see here and here). And I began to see clearly what a principal can and cannot do. Those memories came back to me as I read this post.

First the key parts of the post:

A commonly cited statistic in education leadership circles is that 25 percent of a school’s impact on student achievement can be explained by the principal, which is encouraging for those of us who work in principal preparation, and intuitive to the many educators who’ve experienced the power of an effective leader. It lacks nuance, however, and has gotten us thinking about the state of education-leadership research—what do we know with confidence, what do we have good intuitions (but insufficient evidence) about, and what are we completely in the dark on? ….

Quantifying a school leader’s impact is analytically challenging. How should principal effects be separated from teacher effects, for instance? Some teachers are high-performing, regardless of who leads their school, but effective principals hire the right people into the right grade levels and offer them the right supports to propel them to success.

Another issue relates to timing: Is the impact of great principals observed right away, or does it take several years for principals to grapple with the legacy they’ve inherited—the teaching faculty, the school facilities, the curriculum and textbooks, historical budget priorities, and so on. Furthermore, what’s the right comparison group to determine a principal’s unique impact? It seems crucial to account for differences in school and neighborhood environments—such as by comparing different principals who led the same school at different time points—but if there hasn’t been principal turnover in a long time, and there aren’t similar schools against which to make a comparison, this approach hits a wall.

Grissom, Kalogrides, and Loeb carefully document the trade-offs inherent in the many approaches to calculating a principal’s impact, concluding that the window of potential effect sizes ranges from .03 to .18 standard deviations. That work mirrors the conclusions of Branch, Hanushek, and Rivkin, who estimate that principal impacts range from .05 to .21 standard deviations (in other words, four to 16 percentile points in student achievement).

Our best estimates of principal impacts, therefore, are either really small or really large, depending on the model chosen. The takeaway? Yes, principals matter—but we still have a long way to go to before we can confidently quantify just how much.

I thoroughly agree with the researchers’ last sentence. But I did have problems with these assertions supported by two studies they listed.

*That principals are responsible for 25 percent of student gains on test scores (teachers, the report account for an additional 33 percent of those higher test scores). I traced back the source they cited and found these statements:

A 2009 study by New Leaders for New Schools found that more than half of a school’s impact on student gains can be attributed to both principal and teacher effectiveness – with principals accounting for 25 percent and teachers 33 percent of the effect.

The report noted that schools making significant progress are often led by a principal whose role has been radically re-imagined. Not only is the principal attuned to classroom learning, but he or she is also able to create a climate of hard work and success while managing the vital human-capital pipeline.

These researchers do cite studies that support their points about principals and student achievement but cannot find the exact study that found the 25 percent that principals account for in student test scores. Moreover, they omit  studies that show  higher education programs preparing principals who have made a difference in their graduates raising student test scores (see here).

I applaud these researchers on their efforts to improve the university training that principals receive but there is a huge “black box” of unknowns that explain how principals can account for improved student achievement. Opening that “black box” has been attempted in various studies that Jane David and I looked at a few years ago in Cutting through the Hype

The research we reviewed on stable gains in test scores across many different approaches to school improvement all clearly points to the principal as the catalyst for instructional improvement. But being a catalyst does not identify which specific actions influence what teachers do or translate into improvements in teaching and student achievement.

Researchers find that what matters most is the context or climate in which the actions occurs. For example, classroom visits, often called “walk-throughs,” are a popular vehicle for principals to observe what teachers are doing. Principals might walk into classrooms with a required checklist designed by the district and check off items, an approach likely to misfire. Or the principal might have a short list of expected classroom practices created or adopted in collaboration with teachers in the context of specific school goals for achievement. The latter signals a context characterized by collaboration and trust within which an action by the principal is more likely to be influential than in a context of mistrust and fear.

So research does not point to specific sure-fire actions that instructional leaders can take to change teacher behavior and student learning. Instead, what’s clear from studies of schools that do improve is that a cluster of factors account for the change.

Over the past forty years, factors associated with raising a school’s academic profile include: teachers’ consistent focus on academic standards and frequent assessment of student learning, a serious school-wide climate toward learning, district support, and parental participation. Recent research also points to the importance of mobilizing teachers and the community to move in the same direction, building trust among all the players, and especially creating working conditions that support teacher collaboration and professional development.

In short, a principal’s instructional leadership combines both direct actions such as observing and evaluating teachers, and indirect actions, such as creating school conditions that foster improvements in teaching and learning. How principals do this varies from school to school–particularly between elementary and secondary schools, given their considerable differences in size, teacher peparation, daily schedule, and in students’ plans for their future. Yes, keeping their eyes on instruction can contribute to stronger instruction; and, yes, even higher test scores. But close monitoring of instruction can only contribute to, not ensure, such improvement.

Moreover, learning to carry out this role as well as all the other duties of the job takes time and experience. Both of these are in short supply, especially in urban districts where principal turnover rates are high.

I am sure these university researchers are familiar with this literature. I wish them well in their efforts to pin down what principals do that account for test score improvement and incorporate that in a program that has effects on what their graduates do as principals in the schools they lead.




Filed under school leaders, testing

A Story about District Test Scores

This story is not about current classrooms and schools. Neither is this story about coercive accountability, unrealistic curriculum standards or the narrowness of highly-prized tests in judging district quality. This is a story well before Race to the Top, Adequate Yearly Progress, and “growth scores” entered educators’ vocabulary.

The story is about a district over 40 years ago that scored one point above comparable districts on a single test and what occurred as a result. There are two lessons buried in this story–yes, here’s the spoiler. First, public perceptions of  standardized test scores as a marker of “success” in schooling has a long history of being far more powerful than observers have believed  and, second, that the importance of students scoring well on key tests predates A Nation at Risk (1983), Comprehensive School Reform Act (1998), and No Child Left Behind (2002)


I was superintendent of the Arlington (VA) public schools between 1974-1981. In 1979 something happened that both startled me and gave me insight into the public power of test scores. The larger lesson, however, came years after I left the superintendency when I began to understand the potent drive that everyone has to explain something, anything, by supplying a cause, any cause, just to make sense of what occurred.

In Arlington then, the school board and I were responsible for a district that had declined in population (from 20,000 students to 15,000) and had become increasingly minority (from 15 percent to 30). The public sense that the district was in free-fall, we felt, could be arrested by concentrating on academic achievement, critical thinking, expanding the humanities, and improved teaching. After five years, both the board and I felt we were making progress.

State  test scores–the coin of the realm in Arlington–at the elementary level climbed consistently each year. The bar charts I presented at press conferences looked like a stairway to the stars and thrilled school board members. When scores were published in local papers, I would admonish the school board to keep in mind that these scores were  a very narrow part of what occurred daily in district schools. Moreover, while scores were helpful in identifying problems, they were severely inadequate in assessing individual students and teachers. My admonitions were generally swept aside, gleefully I might add, when scores rose and were printed school-by-school in newspapers. This hunger for numbers left me deeply skeptical about standardized test scores as signs of district effectiveness.

Then along came  a Washington Post article in 1979 that showed Arlington to have edged out Fairfax County, an adjacent and far larger district, as having the highest Scholastic Aptitude Test (SAT) scores among eight districts in the metropolitan area (yeah, I know it was by one point but when test scores determine winners  and losers as in horse-races, Arlington had won by a nose).

I knew that SAT results had nothing whatsoever to do with how our schools performed. It was a national standardized instrument to predict college performance of individual students; it was not constructed to assess district effectiveness. I also knew that the test had little to do with what Arlington teachers taught. I told that to the school board publicly and anyone else who asked about the SATs. Few listened.

Nonetheless, the Post article with the box-score of  test results produced more personal praise, more testimonials to my effectiveness as a superintendent, and, I believe, more acceptance of the school board’s policies than any single act during the seven years I served. People saw the actions of the Arlington school board and superintendent as having caused those SAT scores to outstrip other Washington area districts.

The lessons I learned in 1979 is that, first, public perceptions of high-value markers of “quality,” in this instance, test scores, shape concrete realities that policymakers such as a school board and superintendent face in making budgetary, curricular, and organizational decisions. Second, as a historian of education I learned that using test scores to judge a district’s “success” began in the late-1960s when newspapers began publishing district and school-by-school test scores pre-dating by decades the surge of such reporting in the 1980s and 1990s.

This story and its lessons I have never forgotten.



Filed under leadership, testing

Don’t Grade Schools on Grit (Angela Duckworth)

“Angela Duckworth is the founder and scientific director of the Character Lab, a professor of psychology at the University of Pennsylvania and the author of the forthcoming book “Grit: The Power of Passion and Perseverance.” This op-ed appeared in the New York Times, March 26, 2016. 


THE Rev. Dr. Martin Luther King Jr. once observed, “Intelligence plus character — that is the goal of true education.”

Evidence has now accumulated in support of King’s proposition: Attributes like self-control predict children’s success in school and beyond. Over the past few years, I’ve seen a groundswell of popular interest in character development.

As a social scientist researching the importance of character, I was heartened. It seemed that the narrow focus on standardized achievement test scores from the years I taught in public schools was giving way to a broader, more enlightened perspective.

These days, however, I worry I’ve contributed, inadvertently, to an idea I vigorously oppose: high-stakes character assessment. New federal legislation can be interpreted as encouraging states and schools to incorporate measures of character into their accountability systems. This year, nine California school districts will begin doing this.

Here’s how it all started. A decade ago, in my final year of graduate school, I met two educators, Dave Levin, of the KIPP charter school network, and Dominic Randolph, of Riverdale Country School. Though they served students at opposite ends of the socioeconomic spectrum, both understood the importance of character development. They came to me because they wanted to provide feedback to kids on character strengths. Feedback is fundamental, they reasoned, because it’s hard to improve what you can’t measure.

This wasn’t entirely a new idea. Students have long received grades for behavior-related categories like citizenship or conduct. But an omnibus rating implies that character is singular when, in fact, it is plural.

In data collected on thousands of students from district, charter and independent schools, I’ve identified three correlated but distinct clusters of character strengths. One includes strengths like grit, self-control and optimism. They help you achieve your goals. The second includes social intelligence and gratitude; these strengths help you relate to, and help, other people. The third includes curiosity, open-mindedness and zest for learning, which enable independent thinking.

Still, separating character into specific strengths doesn’t go far enough. As a teacher, I had a habit of entreating students to “use some self-control, please!” Such abstract exhortations rarely worked. My students didn’t know what, specifically, I wanted them to do.

In designing what we called a Character Growth Card — a simple questionnaire that generates numeric scores for character strengths in a given marking period — Mr. Levin, Mr. Randolph and I hoped to provide students with feedback that pinpointed specific behaviors.

For instance, the character strength of self-control is assessed by questions about whether students “came to class prepared” and “allowed others to speak without interrupting”; gratitude, by items like “did something nice for someone else as a way of saying thank you.” The frequency of these observed behaviors is estimated using a seven-point scale from “almost never” to “almost always.”

Most students and parents said this feedback was useful. But it was still falling short. Getting feedback is one thing, and listening to it is another.

To encourage self-reflection, we asked students to rate themselves. Thinking you’re “almost always” paying attention but seeing that your teachers say this happens only “sometimes” was often the wake-up call students needed.

This model still has many shortcomings. Some teachers say students would benefit from more frequent feedback. Others have suggested that scores should be replaced by written narratives. Most important, we’ve discovered that feedback is insufficient. If a student struggles with “demonstrating respect for the feelings of others,” for example, raising awareness of this problem isn’t enough. That student needs strategies for what to do differently. His teachers and parents also need guidance in how to help him.

Scientists and educators are working together to discover more effective ways of cultivating character. For example, research has shown that we can teach children the self-control strategy of setting goals and making plans, with measurable benefits for academic achievement. It’s also possible to help children manage their emotions and to develop a “growth mind-set” about learning (that is, believing that their abilities are malleable rather than fixed).

This is exciting progress. A 2011 meta-analysis of more than 200 school-based programs found that teaching social and emotional skills can improve behavior and raise academic achievement, strong evidence that school is an important arena for the development of character.

But we’re nowhere near ready — and perhaps never will be — to use feedback on character as a metric for judging the effectiveness of teachers and schools. We shouldn’t be rewarding or punishing schools for how students perform on these measures.

MY concerns stem from intimate acquaintance with the limitations of the measures themselves.

One problem is reference bias: A judgment about whether you “came to class prepared” depends on your frame of reference. If you consider being prepared arriving before the bell rings, with your notebook open, last night’s homework complete, and your full attention turned toward the day’s lesson, you might rate yourself lower than a less prepared student with more lax standards.

For instance, in a study of self-reported conscientiousness in 56 countries, it was the Japanese, Chinese and Korean respondents who rated themselves lowest. The authors of the study speculated that this reflected differences in cultural norms, rather than in actual behavior.

Comparisons between American schools often produce similarly paradoxical findings. In a study colleagues and I published last year, we found that eighth graders at high-performing charter schools gave themselves lower scores on conscientiousness, self-control and grit than their counterparts at district schools. This was perhaps because students at these charter schools held themselves to higher standards.

I also worry that tying external rewards and punishments to character assessment will create incentives for cheating. Policy makers who assume that giving educators and students more reasons to care about character can be only a good thing should take heed of research suggesting that extrinsic motivation can, in fact, displace intrinsic motivation. While carrots and sticks can bring about short-term changes in behavior, they often undermine interest in and responsibility for the behavior itself.

A couple of weeks ago, a colleague told me that she’d heard from a teacher in one of the California school districts adopting the new character test. The teacher was unsettled that questionnaires her students filled out about their grit and growth mind-set would contribute to an evaluation of her school’s quality. I felt queasy. This was not at all my intent, and this is not at all a good idea.

Does character matter, and can character be developed? Science and experience unequivocally say yes. Can the practice of giving feedback to students on character be improved? Absolutely. Can scientists and educators work together to cultivate students’ character? Without question.

Should we turn measures of character intended for research and self-discovery into high-stakes metrics for accountability? In my view, no.


Filed under testing

Why Common Core Standards Will Succeed

Even though there is little evidence that state standards have increased student academic achievement since the 1980s, the District of Columbia and 45 states have embraced the Common Core–(see here and here).

Even though there is little evidence that countries with national standards do not necessarily score higher on international tests than nations without national standards, many states have already aligned their standards to textbooks, lessons, and tests– (see here and here).

Even though there is little evidence Common Core standards will produce the skilled and knowledgeable graduates that employers and college teachers have demanded of public schools, most state and federal officials have assured parents and taxpayers that the new standards and tests will do exactly that–(see here and here).

Even though there is little evidence that state and national officials have resolved tough issues in the past when it came to curriculum standards (e.g., supplying professional development for teachers and principals, appropriate instructional materials, determining whether teachers altered their practices) much less reduce the inevitable problems that will occur in implementing the Common Core standards (e.g., resources for computer-based testing), cheerleaders  continue to beat the drums for national standards–(see here and here)


butterfly image standards

With all of these “even though”s  (and there are more), Common Core standards will succeed. How can that be?

The short answer is that evidence of success doesn’t matter much to those who make policy decisions. Oh sure, decision-makers have to mention evidence along with research studies and they do but not much when it comes to Common Core standards. Instead what they talk about are failing schools, the low-quality of teaching and how unless academic standards are raised–drum roll here at mention of Common Core–the economy will sink under the weight of graduates unprepared for an information-based workplace. Getting everyone to go to college, especially minority and poor students is somehow seen as a solution to economic, political, and social inequalities that have persistently plagued the U.S. for the past four decades

Reform-minded policy elites–top federal and state officials, business leaders, and their entourages with unlimited access to media (e.g., television, websites, print journalism)–use these talking points to engage the emotions and, of course, spotlight public schools as the reasons why the U.S. is not as globally competitive as it should be. By focusing on the Common Core, charter schools, and evaluating teachers on the basis of student test scores, these decision-makers have shifted public attention away from fiscal and tax policies and economic structures that not only deepen and sustain poverty in society but also reinforce privilege of the top two percent of wealthy Americans. Policy elites have banged away unrelentingly at public schools as the source of national woes for decades.

National, state, and local opinion-makers in the business of school reform know that what matters is not evidence, not research studies, not past experiences with similar reforms–what matters is the appearance of success. Success is 45 states adopting standards, national tests taken by millions of students, and public acceptance of Common Core. Projecting positive images (e.g., the film Waiting for Superman, “everyone goes to college”) and pushing myths (e.g., U.S schools are broken, schools are an arm of the economy) that is what counts in the theater of school reform.

Within a few years–say, by 2016, a presidential election year–policy elites will declare the new standards a “success” and, hold onto your hats, introduce more and better and standards and tests.

This happened before with minimum competency tests in the 1970s. By 1980, thirty-seven states had mandated these tests for grade-to-grade promotion and high school graduation. The Nation at Risk report (1983) judged these tests too easy since most students passed them. So goodbye to competency tests. That happened again in the 1990s with the launching of upgraded state curriculum standards (e.g., Massachusetts) and then NCLB and later Common Core came along. It is happening now and will happen again.

Policy elites see school reform as a form of theater. Blaming schools for serious national problems, saying the right emotionally-loaded words, and giving the appearance of doing mighty things to solve the “school” problem matter far more than hard evidence or past experiences with similar reforms.


Filed under school reform policies, testing

Buying iPads, Common Core Standards, and Computer-Based Testing

The tsunami of computer-based testing for public school students is on the horizon. Get ready.

For adults, computer-based testing has been around for decades. For example, I have taken and re-taken the California online test to renew my driver’s license twice in the past decade. To get certified to drive as a volunteer driver for Packard Children’s Hospital in Palo Alto, I had to read gobs of material about hospital policies and federal regulations on confidentiality before taking a series of computer-based tests. To obtain approval from Stanford University for a research project of which I am the principal investigator and where I would interview teachers and observe classrooms, I had to read online a massive amount of material on university regulations about consent of subjects to participate, confidentiality, and handling of information gotten from interviews and classroom observations.  And again, I took online tests that I had to pass in order to gain approval from the University to conduct research.  Beyond the California Department of Motor Vehicles, Children’s Hospital, and Stanford University, online assessment has been a staple in the business sector from hiring through employee evaluations.  So online testing is already part of adult experiences

What about K-12 students?  Increasingly, districts are adopting computer-based testing. For example, Measures of Academic Progress, a popular test used in many districts is online. Speeding up this adoption of computer-based testing is the Common Core Standards and the two consortia that are preparing assessments for the 45 states on the cusp of implementing the Standards. Many states have already mandated online testing for their own standardized tests to get prepared for impending national  assessments. These tests will require students to have access to a computer with the right hardware, software, and bandwidth to accommodate online testing by 2014-2015 (See here, here, and here).

There are many pros and cons with online testing as, say, compared with paper-and-pencil tests. But whatever those pros are for paper-and-pencil tests, they are outslugged and outstripped by the surge of buying new devices and piloting of computer-based tests to get ready for Common Core assessments (see here and here). Los Angeles Unified school district, the second largest in the nation, just signed a $50 million contract with Apple for  iPads. One of the key reasons to buy these devices for the initial rollout for 47 schools was Common Core standards and assessment. Each iPad comes with an array of pre-loaded software compatible with the state online testing system and impending national assessments. The entire effort is called The Common Core Technology Project.

The best (and most recent) gift to the hardware and software industry has been the Common Core standards and assessments. At a time of fiscal retrenchment in school districts across the country when schools are being closed and teachers are let go, many districts have found the funds to go on shopping sprees to get ready for the Common Core.

And here is the point that I want to make. The old reasons for buying technology have been shunted aside for a sparkling new one. Consider that for the past three decades the rationale for buying desktop computers, laptops, and now tablets has been three-fold:

1. Make schools more efficient and productive so that students learn more, faster, and better than they had before.

2. Transform teaching and learning into an engaging and active process connected to real life.

3. Prepare the current generation of young people for the future workplace.

After three decades of rhetoric and research, teachers, principals, students, and vendors have their favorite tales to prove that these reasons have been achieved. But for those who want more than Gee Whiz stories, who seek a reliable body of evidence that shows students learning more, faster, and better, that shows teaching and learning to have been transformed, that using these devices have prepared the current generations for actual jobs—well, that body of evidence is missing for each of these traditional reasons to buy computers.

With Common Core standards adopted, the rationale for getting devices has shifted. No longer does it  matter whether there is sufficient evidence to make huge expenditures on new technologies. Now, what matters are the practical problems of being technologically ready for the new standards and tests in 2014-2015: getting more hardware, software, additional bandwidth, technical assistance, professional development for teachers, and time in the school day to let students practice taking tests.

Whether the Common Core standards will improve student achievement–however measured–whether students learn more, faster, and better–none of this matters in deciding on which vendor to use. It is not whether to buy or not. The question is: how much do we have and when can we get the devices. That is tidal wave on the horizon.


Filed under technology, testing

Cheating Scandals Reaffirm, Not Diminish, Testing

Not until the trials (or plea bargains) are over, will a verdict be rendered on former Superintendent Beverly Hall’s guilt or innocence in what is called the Atlanta cheating scandal. Hall’s indictment follows on the heels of finding El Paso Superintendent Lorenzo Garcia guilty last Fall. He is now serving three and a half years in jail (see here and here).

Even before a judge or jury decides on her guilt or innocence, anti-testing groups, feeding on Atlanta, El Paso, and the investigation of tampering with test scores under Washington, D.C. school chief, Michelle Rhee, have grabbed the case to further their cause. Moreover, over the years, journalists have uncovered oddities in test scores jumping sky-high in one year in other districts across the nation.

Foes of standardized tests feel the rush of adrenalin in saying that these examples of dishonest adults raising student test scores to receive applause and cash awards are pervasive. Defenders of standardized testing and accountability, however, see the  cheating as exceptions, as a few rotten apples in a barrel full of worm-free ones. Most educators, advocates of test-driven accountability say, are decent, hard working professionals who play by the rules and can be trusted to do the right thing.

In this volleying back-and-forth between advocates and foes of standardized testing,  school scandals have been compared to cheating in baseball, bicycle racing, and other sports.

From Mark McGuire‘s stained home run record to Tour de France winner Lance Armstrong‘s admission that he doped while racing, these and other sports have come under a dark cloud of suspicion–an outcome damaging to top athletes, companies dependent upon income derived from professional sports, fans turning into cynics, and disappointed youth who only want to play the game by the rules.

Cheating in both sports and schools can be traced to the unleashed and fierce competition in performing better and better to gain ever-larger rewards. Professional sports are money machines and being a top performer is rewarded handsomely; scores on international tests, ranking schools within a state and district based on performance, a broader array of school choices, and federal regulations in No Child Left Behind and Race to the Top  have ratcheted upward intense pressure to beat  state tests.

Also common to school cheating and drug-drenched sports is betraying the public trust to gain personal advantage.  When adults erase student answers and professional athletes take illegal drugs to enhance performance, such acts erode the faith that adults and youth have in social institutions being fair.

Another common feature is the unshaken confidence that current authorities have in written and computerized tests assessing student learning and drug tests determining whether athletes are cheating. When cheating is uncovered, few decision-makers question the tests. Tighter security and better tests are the solutions.

*Few decision-makers question whether there might be something wrong in professional athletics (i.e., expansion of baseball, football, hockey, and basketball leagues and over-the-top competition for more money).

*Few decision-makers question whether most toddlers and young children from low-income families should be tested especially since they bring to school very different strengths and weaknesses than children from middle and upper-income homes. Or that such early testing of young children squeezes inequities into judgments of what they can and cannot do in preschool and elementary school classrooms.

*Few decision-makers question the national obsession with student test scores as the correct metric to judge schools, teachers, and students.

This deep reluctance to question powerful interests invested in socioeconomic structures and cultures in which cheating occurs is why I believe that standardized tests in schools, like drug testing in sports, will be reaffirmed rather than overturned. There will be continuing challenges–as there should be–but standardized testing will remain rock-solid. Why?

First, note that most of the cheating incidents have been largely in districts where high percentages of poor and minority students attend school. Sure, there are exceptions but when you look closely at where dishonesty is found, those charters and regular public schools enroll large numbers of children from low-income families. I have yet to find any district school boards, investigators, charter school leaders or policymakers recommend examining the tests to see if they do what they are supposed to do or, after conducting such an examination, finding unworthy tests and getting rid of them. Yes, there have been protests by educators, students, and middle- and upper-middle class families against too much standardized testing (see here and here). These protests have led to occasional boycotts but none have occurred, to my knowledge, in poor neighborhoods. If anything, there is a reaffirmation of tests, calls for greater security, and plaudits for any whistle-blowers.

The point is that these tests sort students and schools by scores that  reinforce rather than erase existing gaps in achievement. And sorting is necessary to determine who, beginning at the age of four, shall climb each rung of that ladder reaching college. The system of private and public schooling requires such tests to distinguish high achievers from others. If the tests were really that accurate in making such distinctions across children and youth of being smart on paper, with people, and in life now and later, then, perhaps we need such tests . But that is not the case now… by a long shot.

Second, to underscore the above point, consider the experience of cheating on the SAT. After a scandal revealed that high-scoring individuals with fake IDs were paid to take the SAT test, Educational Testing Service tightened security at test sites. No challenges of the test itself occurred. SAT scores remain crucial for college admission and no school boards, teachers, or parent groups called for the end of the test.

Count on cheaters getting more clever and investigators still hunting them down. Amid increasing numbers of cheating incidents, standardized tests will be challenged, maybe the numbers even reduced, but nonetheless, they will reign for the immediate future.


Filed under leadership, school reform policies, testing

From Quill Pens to Computer Adaptive Testing: Old and New Technological Devices

There are many definitions of instructional technology. One concentrates on devices teachers use in classrooms.  Another definition focuses on the different ways that teachers have used such devices as tools to advance learning in lessons. Even other definitions frame technology as processes, ways of organizing classrooms, schools, and districts.

I examine the second definition in this post: the connection between writing tools students used and the perpetual demand over the past two millennia of teachers in every culture to find out what students have learned. Here I consider the quill and steel-tipped pen, pencil, ball-point pen, and yes, the computer.

I begin with the quill pen.


Here is how Robert Travers (1983, pp. 97-98) described quill pens.

The quill pen was first mentioned in the writings of Saint Isodore of Seville in the seventh century…. The quill seems to have been by far the best writing instrument invented in its time for it displaces all other forms. It became the main instrument used in schools, apart from the slate…. Even in the late 1800s, the quill pen was still the most widely used instrument for writing. Quills [came] from the wings of geese, but swan quills were also highly valued. For fine work, the quills of crows were sometimes used.

In 1809, an inventor, Joseph Bramah, developed a machine for cutting quills into lengths, and the short lengths were then inserted into a wooden holder….The separation of the point and the holder led to many inventions, and one of these was the [metal-tipped pen]….The first factory for the mass production of the steel pen was established in New Jersey in 1870….

When the steel pen entered education, a revolution in school practice [occurred]. Writing with the quill had been a slow, unhurried art…. [T]he writer had to stop frequently in order to reshape and sharpen the quill. Since writing was a slow art, pride was taken in it….The steel pen changed that. The steel pen made it possible to write continuously over long periods. There was ever increasing pressure on the pupil to produce written material in quantity. The new medium for written work then became used for examinations, which became substitutes for the form of oral examination provided by the recitation [where students would be quizzed in public for their knowledge]….

By 1890, students had become so used to the steel pen that examinations were commonly administered using this writing instrument as a tool to produce rapidly written answers.


In the Pittsburgh (PA) public schools as an elementary school student in 1940 at a now demolished Minersville elementary school, I sat at one of the above desks with the hole for the then-defunct inkwell.

What about pencils? Like metal-tipped pens, mass-produced pencils did not appear in most classrooms until the early decades of the 20th century. And with pencils, teachers assessed what students learned through hand-written homework, quizzes, essays, and multiple-choice tests (introduced in the U.S. during World War I). These inexpensive devices–mass-produced ball-point pens arrived in schools in the 1940s—made assessing students’ knowledge inexpensive–after all, no one pays students to take tests or for lost time learning–and efficient in judging promotion, retention, graduation, and other high-stakes outcomes.

Now arrives computer adaptive testing (CAT). Used a great deal in the private sector for employment and other purposes, over the past few decades, computerized testing has entered schools. In Measures of Academic Progress (MAP), for example, students sit in front of computer screens and take tests that are tailored to their ability. When a student answers an item (usually multiple-choice) correctly, then the student is given a harder item to answer. If the student gives a wrong answer, then the screen shows an easier question. This goes on until the computer bank runs out of items to administer students or the computer has sufficient information to give the student a score. Whichever happens first, then the test is over.

Highly touted by promoters and vendors–see McGraw-Hill YouTube segment for an example of hype–CAT is part of the package that new national tests accompanying Common Core standards will include by 2014. There are, as with any new technological device, clear advantages and disadvantages of this form of assessment (see Computer Adaptive Testing).

Like quill  and steel-tipped pens dipped in ink, pencils, and ballpoint pens, here is another technological device that is being bent toward finding out what students know. Ideally, of course, there would be no need for CAT or the mountain high summative tests currently in vogue across the country were the nation’s teachers sufficiently trusted to use the many ways teachers assess daily what their students know and can do. And further, for districts to build and increase teacher knowledge and skills in assessment That kind of time investment in teacher knowledge and skills and the accompanying trust in teachers and schools to assess and report the results are, sad to say, missing-in-action.

So watch computer adaptive testing become the new steel-tipped pen of the late-19th century.


Filed under how teachers teach, technology use, testing

Chicago Teachers’ Strike, Performance Evaluation, and School Reform (Jack Schneider and Ethan Hutt)

For the fifth consecutive day in Chicago, nearly 30,000 teachers are out on strike. At issue are many of the contractual details that typically complicate collective bargaining—pay, benefits, and the length of the work day. But the heart of the dispute roiling the Windy City is something relatively new in contract talks: a complicated statistical algorithm.

District leaders in Chicago, following the lead of reformers in cities nationwide, are pushing for a “value-added” evaluation system. Unlike traditional forms of evaluation, which rely primarily on classroom observations, policymakers in Chicago propose to quantify teacher quality through the analysis of student achievement data. Using cutting-edge statistical methodologies to analyze standardized test scores, the district would determine the value “added” by each teacher and use that information as a basis for making personnel decisions.

Teachers are opposed to this approach for a number of reasons. But educational researchers are generally opposed to it, too, and their reasoning is far less varied: value-added evaluation is unreliable.

As researchers have shown, value-added methodologies are still very much works-in-progress. Scholars like Heather Hill have found that value-added scores correlate not only with quality of instruction, but also with the population of students they teach. Researchers examining schools in Palm Beach, Florida discovered that more than 40 percent of teachers scoring in the bottom decile one year, according to value-added measurements, somehow scored in the top two deciles the following year. And according to a recent Mathematica study, the error rate for comparing teacher performance was 35 percent. Such figures could only inspire confidence among those working to suspend disbelief.

And yet suspending disbelief is exactly what reformers are doing. Instead of slowing down the push for value-added, they’re plowing full steam ahead. Why?

The promise of a mechanized quality-control process, it turns out, has long captivated education reformers. And while the statistical algorithm in question right now in Chicago happens to be quite new, reformer obsession with ostensibly standardized, objective, and efficient means of gauging value is, in fact, quite old. Unfortunately, as the past reveals, plunging headlong into a cutting-edge measurement technology is also quite problematic.

Example 1:
Nearly a century ago, school leaders saw a breakthrough in measurement technology as a way of measuring teacher quality. By using newly-designed IQ tests to assess “native ability,” school administrators could translate student scores on standardized tests into measures of teacher effectiveness. Of course, not everyone was on board with this effort. As one school superintendent noted, some educators were concerned “that the means for making quantitative and qualitative measures of the school product” were “too limited to provide an adequate basis for judgment.” But the promise of the future was too tempting and, as he argued, though it was “impossible” to measure teacher quality rigorously, “a good beginning” had been made. Reformers plowed ahead.

The IQ movement was deeply flawed. The instruments were faulty and culturally-biased. The methodology was inconsistent and poorly applied. And the interpretations were horrifying. “If both parents are feeble-minded all the children will be feeble-minded,” wrote H.H. Goddard in 1914. “Such matings,” he reasoned, “should not be allowed.” Others drew equally shocking conclusions. E.G. Boring, a distinguished psychologist of the period, wrote in 1920 that “the average man of many nations is a moron.” The average Italian-American adult, he calculated, had a mental age of 11.01 years. African-Americans were at the bottom of his list, with an average mental age of 10.41.

Value-added proponents like to make the argument that “some data is better than no data.” Yet in the case of the mental testing movement, that was patently false. For the hundreds of thousands of students
tracked into dead-end curricula, to say nothing of the forced sterilization campaigns that took place outside of schools, reform was imprudent and irresponsible.

But one need not go back so far into the educational past for examples of half-baked quality-control reforms peddled by zealous policymakers.

Example 2:
In the 1970s, 37 states hurriedly adopted “minimum competency testing” legislation and implemented “exit examinations,” ignoring the concerns of experts in the field. As one panel of scholars observed, the plan was “basically unworkable” and exceeded “the present measurement arts of the teaching profession.” Reformers, however, were not easily dissuaded.

The result of the minimum competency movement was the development of a high stakes accountability regime and years of litigation. Reformers claimed that the information revealed by such tests would provide the sunlight and shame that schools needed to improve. Yet while they awaited that outcome, thousands of students suffered the indignity of being labeled “functionally illiterate,” were forced into remedial classes, and had their diplomas withheld despite having enough units to graduate—all on the basis of a test that leading scholars described as “an indefensible technology.”

Contrary to what reformers claimed, the information provided by such deeply flawed tests did little to improve students’ learning or the quality of schools.

Today’s policymakers, like those of the past, want to adopt new tools as swiftly as possible. Even flawed value-added measures, they argue, are better than nothing. Yet the risks of early adoption, as the past reveals, can far outweigh the rewards. Simply put, acting rashly on incomplete information makes mistakes more costly than necessary.

Today’s value-added boosters believe themselves to be somehow different—acting on better incomplete information. Yet the idea that incomplete information can be good strains credulity.

Good technologies do tend to improve over time. And if advocates of value-added models are confident that they can work out the kinks, they should continue to judiciously experiment with them. In the meantime, however, such models should be kept out of all high-stakes personnel decisions. Until we can make them sufficiently work, they shouldn’t count.


Jack Schneider is an Assistant Professor of Education at the College of the Holy Cross and author of Excellence For All: How a New Breed of Reformers Is Transforming America’s Public Schools. Ethan Hutt is a doctoral candidate at the Stanford University School of Education and has been named “one of the world’s best emerging social entrepreneurs” by the Echoing Green Foundation.


Filed under how teachers teach, school reform policies, testing

Testing, Testing, and Testing: More Cartoons

The U.S. has tests galore. Driving, alcohol, steroids, DNA, citizenship, blood,  pregnancy–and on and on. Most serve a specific purpose and carry personal consequences if one passes or fails. School tests, however, to pass a course, to be promoted to another grade, to graduate and to judge whether the school is satisfactory or on probation have proliferated dramatically in the past three decades. Opinions are split among Americans about these tests.

Surveys report that most teachers (but by no means all) believe that there is too much standardized testing. Some parents have mobilized to boycott annual tests. Most respondents to opinion polls, however, support curriculum standards, accountability, and, yes, state tests.

Of the many cartoons on testing that I have located, most reflect the opinion that there is too much testing and too much is made of the results. I have found very few–none that I can recall or that I have posted–endorsing standardized tests. Here is a sampling of those cartoons.

For those readers who wish to see previous monthly posts of cartoons, see: “Digital Kids in School,” “Testing,” “Blaming Is So American,”  “Accountability in Action,” “Charter Schools,” and “Age-graded Schools,” Students and Teachers, Parent-Teacher Conferences, Digital Teachers, and Addiction to Electronic Devices.


Filed under testing

Three Important Distinctions In How We Talk About Test Scores (Matt DiCarlo)

“Matthew Di Carlo is a senior fellow at the non-profit Albert Shanker Institute in Washington, D.C. His current research focuses mostly on education policy, but he is also interested in social stratification, work and occupations, and political attitudes/behavior.”  The post appeared May 25, 2012

In education discussions and articles, people (myself included) often say “achievement” when referring to test scores, or “student learning” when talking about changes in those scores. These words reflect implicit judgments to some degree (e.g., that the test scores actually measure learning or achievement). Every once in a while, it’s useful to remind ourselves that scores from even the best student assessments are imperfect measures of learning. But this is so widely understood – certainly in the education policy world, and I would say among the public as well – that the euphemisms are generally tolerated.

And then there are a few common terms or phrases that, in my personal opinion, are not so harmless. I’d like to quickly discuss three of them (all of which I’ve talked about before). All three appear many times every day in newspapers, blogs, and regular discussions. To criticize their use may seem like semantic nitpicking to some people, but I would argue that these distinctions are substantively important and may not be so widely-acknowledged, especially among people who aren’t heavily engaged in education policy (e.g., average newspaper readers).

So, here they are, in no particular order.

In virtually all public testing data, trends in performance are not “gains” or “progress.” When you tell the public that a school or district’s students made “gains” or “progress,” you’re clearly implying that there was improvement. But you can’t measure improvement unless you have at least two data points for the same students – i.e., test scores in one year are compared with those in previous years. If you’re tracking the average height of your tomato plants, and the shortest one dies overnight, you wouldn’t say that there had been “progress” or “gains,” just because the average height of your plants suddenly increased.

Similarly, almost all testing trend data that are available to the public don’t actually follow the same set of students over time (i.e., they are cross-sectional). In some cases, such as NAEP, you’re comparing a sample of fourth and eighth graders in one year with a different cohort of fourth and eighth graders two years earlier. In other cases, such as the results of state tests across an entire school, there’s more overlap – many students remain in the sample between years – but there’s also a lot of churn. In addition to student mobility within and across districts, which isoften high and certainly non-random, students at the highest tested grade leave the schools (unless they’re held back), while whole new cohorts of students enter the samples at the lowest tested grade (in middle schools serving grades seven and eight, this means that half the sample turns over every year).

So, whether it’s NAEP or state tests, you’re comparing two different groups of students over time. Often, those differences cannot be captured by standard education variables (e.g., lunch program eligibility), but are large enough to affect the results, especially in smaller schools (smaller samples are more prone to sampling error). Calling the differences between years “gains/progress” or “losses” therefore gives a false impression; at least in part, they are neither – reflecting nothing more than variations between the cohorts being compared.

 Proficiency rates are not “scores.” Proficiency or other cutpoint-based rates (e.g., percent advanced) are one huge step removed from test scores. They indicate how many students scored above a certain line. The choice of this line can be somewhat arbitrary, reflecting value judgments and, often, political considerations as to the definition of “proficient” or “advanced.” Without question, the rates are an accessible way to summarize the actual scale scores, which aren’t very meaningful to most people. But they are interpretations of scores, and severely limited ones at that.*

Rates can vary widely, using the exact same set of scores, depending on where the bar is set. In addition, all these rates tell you is whether students were above or below the designated line – not how far above it or below it they might be. Thus, the actual test scores of two groups of students might be very different even though they have the same proficiency ranking, and scores and rates can move in opposite directions between years.

To mitigate the risk of misinterpretation, comparisons of proficiency rates (whether between schools/districts or over time) should be accompanied by comparisons of average scale scores whenever possible. At the very least, the two should not be conflated.**

Schools with high average test scores are not necessarily “high-performing,” while schools with lower scores are not necessarily “low-performing.” As we all know, tests don’t measure the performance of schools. They measure (however imperfectly) the performance of students. One can of course use student performance to assess that of schools, but not with simple average scores.

Roughly speaking, you might define a high-performing school as one that provides high-quality instruction. Raw average test scores by themselves can’t tell you about that, since the scores also reflect starting points over which schools have no control, and you can’t separate the progress (school effect) from the starting points. For example, even the most effective school, providing the best instruction and generating large gains, might still have relatively low scores due to nothing more than the fact the students it serves have low scores upon entry, and they only attend the schools for a few years at most. Conversely, schools with very high scores might provide poor instruction, simply maintaining (or even decreasing) the already stellar performance levels of the students it serves.

We very clearly recognize this reality in how we evaluate teachers. We would never judge teachers’ performance based on how highly their students score at the end of the year, because some teachers’ students were higher-scoring than others’ at the beginning of the year.

Instead, to the degree that school (and teacher) effectiveness can be assessed using testing data, doing so requires growth measures, as these gauge (albeit imprecisely) whether students are making progress, independent of where they started out and other confounding factors. There’s a big difference between a high-performing school and a school that serves high-performing students; it’s important not to confuse them.


* Although this doesn’t affect the point about the distinction between scores and rates, it’s fair to argue that scale scores also reflect value judgments and interpretations, as the process by which they are calculated is laden with assumptions – e.g., about the comparability of content on different tests.

** Average scores, of course, also have their strengths and weaknesses. Like all summary statistics, they hide a lot of the variation. And, unlike rates, they don’t provide much indication as to whether the score is “high” or “low” by some absolute standard (thus making them very difficult to interpret), and they are usually not comparable between grades. But they are a better measure of the performance of the “typical student,” and as such are critical for a more complete portrayal of testing results, especially viewed over time.


Filed under testing