Assessing Teachers

The Death and Life of the Great American School System
by Diane Ravitch

(Basic Books, 2010)

Reviewed by Sarah Schumacher
Secondary Literacy & Social Studies (WA)

TLN New Millennium Initiative

Why would a powerful, successful advocate for what amounted to a revolution in our education system completely change her mind about the initiatives she once supported? And what happens when she does? What do we do next? Those were the questions on my mind as I picked up Diane Ravitch’s newest book The Death and Life of the Great American School System. I had heard of Diane Ravitch and her ideas many times in my career and had heard rumblings about her so-called ‘mea culpa’, and so was excited to find out what her motivations were, what she had seen that so completely changed her mind.

There are no sacred cows in this book; Ravitch pulls no punches. She systematically goes after many of the initiatives and policies that have been held up in the last years as cure-alls for the ills of our education system: testing, tenure, charter schools, Teach for America, vouchers. . .the list goes on. As she says late in the book, “American education has a long history of infatuation with fads and ill-considered ideas.” As educators, we see so many initiatives rolling through that are promised to be panaceas that we often know will lead to nothing. So it’s refreshing to hear someone speaking so candidly and with so much depth. On the other hand, as you read the book you’re left wondering what else is there? If these things aren’t “magic feathers,” then what should we be doing instead? We definitely learn the ‘why’ of her transformation, but not so much the ‘what next?’

Her first chapter, “What I Learned About School Reform,” outlines Ravitch’s career as an educational researcher and writer and subsequent ascension to the position of assistant secretary of education in the George H.W. Bush administration. One thing I respected immediately about her arguments is that she doesn’t let herself off the hook for the role she played. She admits that, “I began ‘seeing like a state,’ looking at schools and teachers and students from an altitude of 20,000 feet and seeing them as objects to be moved around by big ideas and great plans.” The chapter then chronicles her change of heart as she realizes the initiatives proliferating education are not getting the results they should. She ends by beginning her argument that in the era of NCLB education was beginning to be viewed as an institution that could, and should, be run as if it were a private, for-profit enterprise. However, she emphasizes that she does not have clear alternatives of her own. More about that later.

The second chapter, “Hijacked! How the Standards Movement Turned Into the Testing Movement,” continues to set the context for NCLB. It is in this chapter that you learn three things about Diane Ravitch. First, she strongly dislikes NCLB and all its progeny: testing, so-called accountability, choice, etc.. Second, she appreciated the 1983 report A Nation at Risk and the prescription it gave the nation’s education system. Third, and most of all, she likes strong standards and curriculum, believing that they lead to more well-rounded, deeper thinking students. Take note, because that’s about the only thing she appears to like in the entirety of the book.

The book then becomes a whirlwind of detail in a House-That-Jack-Built layered style of argumentation. In other words: She really makes her case. The third, fourth and fifth chapters tell the stories of three different school districts and how fundamental changes to their organizations and policies in the mode of NCLB-era ideas led to uncertain outcomes. Those uncertain outcomes are a theme throughout the rest of the book. It seems that for every initiative there are a thousand studies, all of them reaching a different conclusion.

The next three chapters form the crux of her argument: “NCLB: Measure and Punish”, “Choice: The Story of an Idea”, and “The Trouble with Accountability.” In these chapters she outlines, detail by detail (by detail) the case against No Child Left Behind and its policies. In “What Would Mrs. Ratliff Do?”, she talks about the growing movement to link teacher evaluations to test scores and wonders if her own favorite teacher, Mrs. Ratliff, would be considered a great teacher today, she of the red marking pen and nineteenth-century poetry. Sadly, she probably wouldn’t.

Finally, the next to last chapter, “The Billionaire Boys’ Club” is aimed directly at those large foundations and endowments that, Ravitch argues, are driving education reform with their own agendas instead of seeking out innovators already in the field. She talks at length about the Gates Foundation and its small schools agenda and how the Broad Foundation is supporting the movement to turn school administration into a business. This seems to be the core of her argument, that the more the powers that be have treated education as a business, the more detrimental it has been to our nation’s students. She makes this argument thoroughly and leaves no question marks about any of the major factors impacting education today.

Given that, what was unexpected for me was that there are many questions left unanswered at the end of the book. I finally reached the chapter I’d been waiting for, “Lessons Learned,” and found it pretty unsatisfying. Throughout this shortest of chapters, she uses the refrain “Our schools will not improve if…” to share what she thinks should be the priorities of our education system. She brings up national standards and common curriculum and talks about what the goals of testing and teacher evaluation should be, but gives very few specifics.

I guess I’d liken it to hearing a firebrand speaker and getting passionately excited about the cause only to be given a tin sword with which to go start the revolution. We need much more than lofty generalities to fix what is broken about our system. In all, though, I found this book incredibly well-argued, thought-provoking and interesting and would recommend it to anyone who wants to know the other side of the story of education in the last decade.

Sarah Schumacher is a secondary literacy coach and social studies coordinator in the Edmonds, Washington School District. She’s a member of the New Millennium Initiative teacher team exploring teaching policy issues in her state.


Making the Grades: My Misadventures in the Standardized Testing Industry
By Todd Farley
(PoliPoint Press, 2009)


Reviewed by Kenneth Bernstein, NBCT

High School Government & Social Studies (MD)

Teacher Leaders Network

As the use of tests created external to schools and classrooms has exploded, one issue has always been the question of whether to rely merely upon selected response (a.k.a. multiple choice) items, or to also include constructed response items (paragraphs and essays).

Selected responses are cheap to administer; they can be scored solely by machine and the results obtained quickly. It is even possible, utilizing item response theory, to administer the test on a computer and use early responses to vary the items offered to the test taker, thereby determining the level of performance more quickly and accurately.

But many think we need more: after all, life does not ask us to choose one out of four or five pre-selected choices. Thus many colleges and universities, employers, and classroom teachers prefer that the tests include constructed items — “essays” if you will.

While it is possible to machine-score such items, that technology is still in its relative infancy, which is why companies that produce tests have need of human scorers. And it is because of this need that we get Todd Farley’s book Making the Grades.

Farley spent 15 years in a variety of positions involved with the scoring of such constructed responses. He worked for a number of America’s most important assessment companies, often doing the work on contract for various states, including Virginia, where I live. 

I am not a trained psychometrician, although during my now-abandoned doctoral studies I did seriously study issues of assessment. I am a school teacher today, in my 15th year of teaching. Each year but one, I have had to prepare students to sit for external tests – which may or may not have met the criteria to properly be labeled “standardized” — that included constructed responses. These tests have included the Maryland School Performance Assessment Program, the Maryland High School Assessments, and The College Board’s Advanced Placement examinations. During my one year of teaching in Virginia, known for its Standards of Learning (SOL) assessments, the middle school American History test was made up entirely of selected response items. I also bring to this review some experience that parallels Farley: in 2009, I served as a Reader for the Advanced Placement examination in U S Politics and Government and scored one of the four Free Response Questions on that year’s examination.

As I glance at my copy of the book, I have more than 40 sticky notes that I have affixed to pages containing passages I thought might possibly be worth quoting. Some I obviously will forego. Farley offers explanations of terms like reliability and validity, and explains how in the case of reliability, the term was often misused by those supervising the scoring process. Simply put, scoring companies are often satisfied if those scoring agree 80% of the time, even if that to which they agree is erroneous. It is like a scale that consistently reports your weight as 20 pounds less than reality. The information you obtain is reliable — but it is NOT valid.

What educational measurement should provide is the ability to draw valid inferences from the information analyzed. If nothing else, reading this book will raise questions in your mind about whether many of the tests being used to evaluate students, teachers and schools meet that standard.

Farley demonstrates that reliability is not necessarily something we can rely on. Allow me to quote an entire paragraph from pp. 55-56 to illustrate:
But you want to talk about a sliding scale? The scale we used to score writing flopped about like a puppy on a frozen pond, going every which way, keeling over and standing up and falling down. In scoring writing, for instance, an essay that had a good development of ideas could earn a 6, a 5, a 4, maybe even a 3. An essay that was troubled on the sentence level in terms of grammar, usage, and mechanics could earn a 1, a 2, a 3, perhaps even a 4, 5, or 6. (I don’t dispute the idea: Gertrude Stein said of F. Scott Fitzgerald that she’d never met anyone who was such a poor speller, yet he still managed to produce a decent text or two.) The point is that essays with identical levels of ability in certain areas could end up (due to other considerations on the rubric) with significantly different scores. In scoring writing, we were far from having hard and fast rules to live by. It all seemed a little untenable, rather mystifying, and the easiest thing to do was to hand your essay off to your neighbor or plead with your supervisor for help.
That passage references the idea of the rubric, the standard by which the grader is supposed to evaluate the essay. If a rubric is sufficiently clear to give guidance, it also may be too rigid for the occasional creatively written paper. A rigid application of the rubric might, as Farley illustrates on more than one occasion, result in a good piece of writing being undervalued and a poor piece of writing receiving a high score.

And the scoring companies often have little control over the rubric and how it is applied. They are scoring under contracts issued by states that may leave them little flexibility. Allow me to illustrate using an example of a scoring team examining an anchor paper.

Anchor papers are supplied to scorers to give examples of the work expected at each scoring level of a rubric. Farley provides the four-point rubric for an 8th grade writing assessment (descriptive mode). The rubric, provided by the state in question, expects the student to use a five-paragraph format. According to the rubric, for the scorer to assign all 4 points, the organization, focus and development, style and sentence fluency, and grammar-usage-mechanics should be considered “excellent.” For  3 points, they should be considered “good.”

Farley describes how table leaders were trained to lead the scoring of this 8th grade assessment. After reviewing the anchor paper for a score of 3 (which Farley reprints in the book), all of the table leaders were scratching their heads, describing the paper as “lame.” One seventh grade teacher in the group argued that she would not consider this essay good work by her students. Another pointed out that it consisted of only simple sentences, and Farley (also a table leader) noted it had no voice and no style. The response of their trainer Maria is telling:
Maria looked down at the essay. “I’m not saying I’d give this a 3 in my classroom, either, but that’s how we have to score it based on this ‘focused holistic” rubric. Most importantly, in this state’s Department of Education, the essay has a five-paragraph format, with introductory, body, and concluding paragraphs, and an introductory sentence in all five of them.”
As shocking as that is, the reader might not be prepared for what comes next. It’s an anchor paper that contains simply brilliant writing (and would be so for a high school student) but earns a score of 2. Here is Farley’s account of the conversation that ensued among the scorers:
   Greg scoffed. “This kid needs a publisher, not a score from us."
   Maria looked guilty. “I know,” she said. “I certainly wouldn’t give this a 2, either. The writing may be sentimental, but it’s first-draft work from an eighth grader. It’s a damn good response, I agree.”
    “So?” Harlan said.
    “Well, what’s the important person or thing in the essay? It’s her favorite spot, a fact we don’t know until the last sentence. That’s not five-paragraph format, is it? There’s no introductory paragraph, no introductory sentence --”
   “No,” Greg said, “it’s way more artful than that, building up the suspense nicely and using some beautiful descriptive language.”
   “Yup,” Maria agreed shrugging. “I know. But this is how they want us to score them”
   “Really?” I asked. “Rather a tedious five-paragraph essay than a beautifully done three or four paragraphs?”
   “It seems that way,” Maria answered. She looked at us, resigned. We looked back at her defeated.
   “All we care about is the formatting?” Pete asked.
   “That’s not the only thing, “ Maria answered, “but it is the first thing.”
   “Wow,” I said, “it almost seems a kid could get a 3 for turning in an outline.”
   Maria thought about it. “Not quite,” she said.
The book is a good read, such a good read that I hesitate to go into too much detail, so that I don’t spoil the enjoyment – and the shock – you will experience as you read it. But I’ll share a few other samples.

At the time of the passage cited above, the scorers were earning $10/hour or less. They were not required to be content matter knowledgeable, something that was a persistent issue in the experiences Farley cites. Scorers were trained, and had to meet a certain standard of scoring accuracy in order to be allowed to score. But the need for scorers was so great that the standards of accuracy were often bent, and scores were changed and manipulated to maintain acceptable levels of accuracy.

Please note that term — acceptable. Farley cites examples of where too great a level of accuracy could cause problems, and this was truly scary, because the examples involved the scoring of the National Assessment of Educational Progress (NAEP), which is supposed to be the ‘gold standard” by which all other educational assessment in this nation is measured. Farley was told he could not have a higher degree of accuracy than was recorded in previous scoring cycles lest the comparability of scores from cycle to cycle be lost. Ponder that for a while.

I do want to offer some cautions. Farley paints with a very broad brush. What he says is certainly widely applicable, but not universally so.

I teach in Maryland, which until May 2009 included two kinds of Constructed Responses (Brief and Extended) as part of the High School Assessments required in four subjects (Biology, Algebra, English, and Government) for graduation from high school. Each constructed response was scored by the same 4–point rubric, a copy of which students had during the exam. In the scoring process, each response was read by two scorers. Inter-rater reliability required only that the scorers gave adjacent scores, not identical scores. If the scores were not identical, the student received the higher score. I am not sure how accurate a measurement that was, but at least the students got the benefit of the doubt, unlike the scenario above that Farley described.

Cost control is another factor that influences the quality of the scoring process. Farley’s account focuses on scoring companies that paid relatively low wages to individuals who often lacked the necessary professional background to make accurate, independent judgments about the work they were scoring. As a result, a highly controlled system of scoring was imposed. But this method of assessing writing samples is not universal.

Here I speak from my experience as a reader of free response questions on the Advanced Placement exam for US Government and Politics. To score this exam, an individual must teach the subject in a post-secondary institution or have at least three years experience teaching the AP course in a high school. We were certainly qualified as to content. We were also paid substantially more than the $10 an hour Farley cites for the incident above, plus expenses for transportation, food and lodging.

We were thoroughly trained. We had our work closely examined at first, until we demonstrated our competence. We were spot-checked regularly by table leaders and by question leaders. The range of our scores was monitored by computer, and if we showed any scoring patterns that raised questions, our work would be reexamined. But once we got going, the scoring was limited to a single reader because we had over 100,000 exams (four questions each) to be scored in less than a week after training.

I know how seriously the Advanced Placement officials took this process because of my own experience. I read very quickly, and I was so much faster than others that, in the beginning, my work was checked very closely until the question leader determined that I was scoring accurately. When I had any doubt about a response, I would on my own initiative check with one of my fellows and/or with the table leader. That pattern was widespread among my fellow scorers.

I would argue that the AP people have demonstrated that reasonably accurate and consistent scoring of constructed response by properly trained people is possible, if one is willing to accept the concomitant costs.

Still, despite the caveat I offer based on my AP experience, I think Farley’s book is a valuable read with much to tell us about the often poorly understood processes and implications of large-scale high stakes testing. He ends the book with these blunt words:
If I had to take any standardized test today that was important to my future and would be assessed by the scoring processes I have long been a part of, I promise you I would protest; I would fight; I would sue; I would go on a hunger strike or march on Washington. I might even punch someone in the nose, but I would never allow that massive and ridiculous business to have any say in my future without battling it to the bitter, better end.

Do what you want, America, but at least you have been warned.

Want to stir some lively conversation among any gathering of teachers? Bring up teacher evaluation and assessment. For decades, teachers, administrators and policymakers have sparred over the issue -- with little in the way of progress. Most teacher evaluation is still principal-driven, drive-by, and checklist oriented. That could change as the new Administration begins to target -- and fund -- teaching quality initiatives, in concert with the Gates Foundation and other philanthropies.

Will teachers have a voice in this debate? Two TLN members from California aren't waiting to be asked. In a recent joint interview with the New York group TeachersCount, Anthony Cody and David B. Cohen described some fundamental changes they'd like to see in teacher evaluation and assessment -- and warned of the consequences of a narrow approach to making judgments about teaching quality.

Here's a sample:

1. What are some of the problems with current teacher evaluation practices?

Anthony Cody: Time is a big factor. Recent surveys of principals have revealed they have inadequate time for observing and evaluating their teachers. My experience as a Peer Assistance and Review (PAR) coach in my district supports this because over the course of two years I saw dozens of evaluations that were incomplete. Many of these teachers should have been enrolled in PAR, and might have wound up being terminated, but their principals did not have the time to follow through.

This also reflects another weakness of our practice -- that evaluation is the sole responsibility of a few site administrators, and is primarily used as a means of eliminating “bad” teachers. Evaluation tends to occur in the form of a few isolated observations, with little connection to the professional growth of most teachers.

David Cohen: We also see that the tools and training for evaluation are rather uneven. Too many evaluators are going into classrooms armed with checklists that aren’t nearly up to the task of capturing the complexity of what they might observe. And it’s not just the materials, but the evaluators themselves who need development.

I’m fortunate to work in a district where secondary school teachers are mostly evaluated by a fellow teacher serving as the instructional supervisor. Unlike traditional department chairs, these teachers have had some additional training in conducting evaluations. It’s a long-standing and popular practice at this point, with the added benefit of providing teachers with evaluators who know the subject matter. If your principal used to teach English, and you're the AP physics instructor helping students with the calculus involved in their lab work, there seems to be an inherent limitation in that evaluative relationship.

2. What improvements would we see in your ideal evaluation system?

Anthony: We may be able to get beyond the time crunch for the principal if we re-imagine evaluation as something more positive, more collaborative and more integrated with professional culture at a school site.

David: This is a shift in mindset: let’s appeal to the best in professional educators. I’ve never met a teacher who didn’t want to be effective in the classroom. But we know that in order to maximize effectiveness, we need the opportunity to analyze and reflect on our work, and use that process to improve.

The current pace of teaching, and the student loads for secondary school teachers in particular, present huge obstacles to that kind of work: when you’re trying to monitor and manage the learning of 150 students or more, you’re in survival mode too often. If more schools would build in time for careful study of our own work, collaboration with colleagues and guidance by teacher leaders and administrators, we’d be far ahead of current practices. I’m certain we’d end up talking more about students’ learning and achievement, which goes a long way towards solving other issues in the classroom (like classroom management) without letting those issues consume you.

Other interview questions include:

3. Why do teachers resist the use of student performance in teacher evaluations?

4. What are the benefits of improved evaluation if tenured teachers are almost impossible to remove?

5. How does teacher evaluation fit in with current reform efforts?

6. What is the role of teacher evaluation in elevating teacher quality? Should we have performance pay to reward teachers with the best evaluations?

7. How has NCLB affected teacher evaluation?

Read the entire interview with Anthony and David here.

Syndicate content