The problems with standardized tests lie less with the content they cover than with their very form – which drives their content and everything else about them.
The tests have looked pretty much the way they do ever since the fifties – a bunch of kids all in the same place, bubbling in answers to the same questions, under the same strict time limits, under the watchful gaze of a roving proctor. Replicate across district, state, nation, world.
The tests took this form not because it is good at measuring what kids know, but because it is efficient. The form is the product of 20th-century industrialization, with speed, uniformity, quantifiability, and mass production being the governing virtues.
Large-scale testing of this type was made possible by two industrial-era inventions: 1) the multiple choice question, which allowed for the super-quick recording and evaluation of responses; and 2) Scantron technology, which turned the job of checking answers over to an electronic scanning machine that could read bubble sheets at blinding speed, achieving a quantum leap in scalability and cost efficiency.
A third thing created the large-scale standardized test, as well: the science of psychological measurement, itself born of the quintessentially 20th-century project aimed at bringing the rigor and precision of the hard sciences to the messy business of human thought and behavior.
“If only we could control the variables well enough,” supposed the mid-century psychometricians, presumably adjusting their spectacles and checking their clipboards, “why, we could reliably measure even intelligence itself, quantifying each subject’s relative value within the collective!” Backs were slapped. Huzzahs exchanged.
It sounds kind of scary now, in the way that Taylorism and Skinner Boxes sound scary. And some of this science was indeed put to nefarious ends, such as justifying the exclusion of whole groups of Americans from college admission, as The College Board did in its early years.
But mostly the impulse to quantify was sincerely pointed toward improving and democratizing education. After all, if there were a basis for comparing students under standardized conditions, we might be able to glean some reliable insights into how our education system treats different subgroups, how geographic regions differ and why, whether our efforts to educate are improving over time, etc. Maybe we could figure out who needs more help, and how to teach better, and where best to put our resources.
Or we might even be able simply to look at a number to determine who is ready for college and who is perhaps, ahem, not quite Hahvahd material, so sorry.
Notice, however, how many suspect assumptions underlie the whole project. Are all of these students equally prepared for the experience of taking this test? It is, after all, pretty weird and stressful and artificial. Should all kids be expected to have the same knowledge and abilities? Is there really only one type of academic success? Are we confident that the tasks we’re putting in front of students are yielding the information we want? Given how contrived the test format and testing experience are, do we really even know what we’re measuring?
The big question, of course – the one that forever dogs standardized testing – is this: are the tests measuring what they need to or only what they can? And if only what they can, is that good enough to support the kind of test-results-based inferences we’re making about kids and teachers and schools?
The constraints that format standardization places on content, time, and space drive what gets tested. That is, the validity and reliability of the test require the standardization of conditions – which means corralling groups of kids into the same kinds of spaces, for the same amount of time, and asking them the same questions. The multiple choice format historically makes all this possible, but is only good for eliciting certain kinds of knowledge and skills. The main skill it elicits, of course, is the dubious skill of test-taking itself. It can also do pretty well at testing a kid on basic skills, such as the rules of grammar, but cannot prompt from a student her capacity for “higher order” academic abilities, such as generating original ideas for an essay.
In fact, even including some number of constructed response questions and technology-enhanced items, conventional standardized tests cannot elicit the kinds of things essential for authentic academic work in college:
- Analyzing conflicting source documents
- Supporting arguments with evidence
- Solving complex problems that have no obvious answer
- Drawing conclusions
- Offering explanations
- Conducting research
- Thinking deeply about what you’re being taught.
They can’t do it given test-taking time constraints, and they can’t do it because the cost of evaluating this kind of student work on a large scale would blow-up the whole enterprise.
And this short list from the National Research Council leaves a lot out, including the many social, dispositional, and behavioral skills students need for success.
When so much in our education system is determined by scores on large-scale standardized tests – especially school funding and teacher evaluations – it is not surprising that many schools, however frustrated in their own efforts, resort to training kids to perform on the tests. Otherwise, they’re out of business. But that means the kids are not learning what they need most for actual academic success, only what they need for test-taking.
To extend the tragedy, our students aren’t even doing well on the tests they’re being trained to take. How do we know this? The tests!
In other words, we’re operating within a strange, Escher-like world, in which standardized tests serve as the instruments used to monitor how much they themselves are screwing up education.
When we consider all the essential knowledge, skills, and abilities that these tests, because of their very form, cannot elicit and measure, it’s clear that we really need to start reevaluating their usefulness.
One further thing to consider: even now, as more and more tests migrate to computer, as computer adaptive testing and tech-enhanced items and automated scoring et cetera become more common, large-scale standardized tests are still overwhelmingly reliant on the multiple choice question.
That is, even as a world of immensely powerful and networked digital technologies has grown up around them, the tests, in their basic form, are still rooted in the mid-20th-century paradigm of the bubble-in Scantron sheet.
This actually gives rise to hope, however. It raises the possibility that we might already have the tools to create different kinds of instruments for education measurement, but that we’re just not using them. These would be instruments that share the original goals of providing insights for improving and democratizing education, but which also overcome the limitations on content, time, and space that have always made old-school standardization such a poor governing principle for assessing students.
Please, weigh in below with your comments, experiences, observations, etc.
© 2016 BetterRhetor Resources LLC
LevelUp is a blog by William Bryant, examining Assessments, College Admissions, and the Readiness Gap. William is Founder and CEO of BetterRhetor, a company developing new ideas and technologies to address challenges in assessment and instruction. He can be reached at firstname.lastname@example.org. Join the BetterRhetor email list HERE.