Though they caution that it might not tell you anything about your actual biases
Over two decades after it was released to the public, a test that purports to measure the biased and prejudiced feelings of those who take it remains a significant part of popular psychology. Yet while its reliability has come under significant doubt in recent years, the creators and managers of the test still defend it as a useful tool.
The Implicit Association Test, released in 1998 by a group of Harvard researchers, may simply be little more than an entertaining quiz rather than a true measure of one’s hidden biases. Several journalists have suggested that the test is not at all a reliable indicator of internal prejudice, even though the tool has been popularized as just that.
A few years ago at Vox, German Lopez wrote on his experiences taking the test. Lopez took the test three separate times over the course several days and received different results each time. The test indicated he had, alternately, no racial preferences, biases against whites, and biases against blacks.
“It turns out the IAT might not tell individuals much about their individual bias. According to a growing body of research and the researchers who created the test…the IAT is not good for predicting individual biases based on just one test. It requires a collection — an aggregate — of tests before it can really make any sort of conclusions,” Lopez wrote.
At The Cut around the same time, Jesse Singal announced similar findings, writing that the test “has serious issues on both the reliability and validity fronts.” At the time Singal wrote that there was “a serious dearth of published information on test-retest reliability” of the racial implicit bias exam, though Singal said that what little data have been published “suggest the race IAT’s test-retest reliability is far too low for it to be safe to use in real-world settings.”
“If you take the test today, and then take it again tomorrow — or even in just a few hours — there’s a solid chance you’ll get a very different result. That’s extremely problematic given that in the wild, whether on [the project’s website] or in diversity-training sessions, test-takers are administered the test once, given their results, and then told what those results say about them and their propensity to commit biased acts,” Singal added.
That faulty methodology has filtered into the popular consciousness. A glowing 2005 profile of the test in The Washington Post made no mention of the fact that one single test is insufficient for determining biases. That article presents the experiences of two test-takers; the article states that the test had determined they held “biases against homosexuals,” without stating whether or not they took the test multiple times. The article does imply that one of the test-takers “found she had a bias for whites over blacks” after taking the test only once.
Elsewhere, in The New York Times, columnist Nicholas Kristof has written positively about the test without mentioning any caveats about it. As recently as 2015, Kristof suggested readers take the test, writing: “It’s sobering to discover that whatever you believe intellectually, you’re biased about race, gender, age or disability.” Kristof did not mention the test’s inconclusiveness regarding single administrations.
Researchers defend test, research
Numerous academics interviewed by The College Fix defended the test, claiming that it remains a useful tool for predicting or demonstrating implicit biases.
Colin Smith, the project’s director of education, spoke to The Fix at length via email about the test.
Smith noted that some of the test results cited by Singal were around 15 years old. “Honestly, these conversations can get a little tiring for those of us who are engaged in the new accumulation of knowledge because we’re constantly dealing with older conceptions,” he said.
Smith cited numerous studies that have demonstrated varying levels of reliability, with one paper indicating “a test-retest reliability as low as .01 and as high as .36,” another with a reliability of .72, and another demonstrating .54 reliability.
Test reliability is measured on a scale of 0.0 – 1.0, meaning various studies have found the test’s reliability to range from fairly high to dismally low. Smith admitted that this phenomenon is something that researchers are still trying to solve.
“What explains that variability is something that we don’t understand very well yet,” he said.
Surprisingly, Smith defended the test’s apparently erratic results.
“[T]here’s this perception that people get one score, go back again, and get some wildly different score. My guess is that is often what we’d call ‘anecdotal evidence.’ It makes for a great line in a talk or an article,” he said.
“Then, even if it WERE true, it might be evidence of the measure working accurately. In other words, our leading theories are increasingly geared toward a state rather than trait version of what the IAT measures.” Smith offered a comparison of “an accurate detector of your heart rate. It would be GOOD if the numbers on that detector went up as you ran up some stairs and went down after you’d been sitting for a few minutes.”
‘Misconceptions that ought to be corrected’
Tony Greenwald, a member of Project Implicit’s board of directors, declined to comment on the articles by German Lopez and Jesse Singal, calling them “polemical pieces” by “non-scientists.” He said that he “respond[s] to responsible criticism of the IAT when it appears in the peer-reviewed scientific literature.”
However, Greenwald did provide an account of the test and its peer-reviewed studies. “Single administrations of IAT measures should, in general, not be used (a) for individual-diagnostic purposes or (b) for making personnel selection decisions,” he said. Greenwald reminded The Fix that this statement is included on the implicit association test’s website.
“The justification is now better established than when this recommendation was first provided on the Project Implicit site,” Greenwald said, citing data “from two well-established research findings: (a) test–retest reliabilities of single administrations of IAT measures are no more than modest.”
Greenwald said that the test-retest statistic for a single administration of the IAT is .50, “which is well below psychometric standards for individual diagnostic use of individual difference measures.”
Greenwald said that when “multiple IAT measures on the same subjects” were averaged out, the reliability increased greatly.
Calvin Lai, the director of research for Project Implicit, pushed back against the assertion that the tool does not possess any test-retest reliability. Lai, who stressed that he was “speaking as an individual scientist rather than as a formal representative of Project Implicit’s position,” said that the test “predicts real-world behavior in the aggregate when data is combined across many people.”
“However, a single administration of the IAT is not effective in the real-world *diagnosis* of individuals. The IAT was originally designed as a research tool for answering scientific questions about how people think, feel, and behave in the aggregate,” Lai said.
Lai offered an analogy that somewhat echoed Smith’s earlier one about heart rates.
“At the doctor’s office, one high blood pressure measurement may not mean much. You may have ran up the stairs on the way up, felt unusually stressed that day, or any number of other factors. However , that blood pressure assessment does tell us a bit about a person’s blood pressure in general. Not enough to be diagnostic, but not enough to be completely ignored. A single IAT measurement performs in a similar way,” he said.
“I routinely see understatements which suggests us the IAT tells us almost nothing at all about hidden biases, and overstatements which suggests that the IAT is diagnostic of individuals. Both types of statements are misconceptions that ought to be corrected,” Lai added.
Test’s popularity continues
Though its creators and managers are frank about the test’s limitations regarding individual test-takers, the tool remains highly popular. Both Lai and Smith gave high estimates for the number of implicit association tests that are taken every day. “From our recent estimates, over a million IATs have been started per year,” Lai said. Smith said that “it’s easy to estimate that something like 10 IATs are completed in any given minute of the day.”
The sheer number of times the test has been taken raises questions about how the public is perceiving it, and whether or not the test’s shortcomings have been adequately communicated to the public.
Lai told The Fix that the “debriefing” page of the implicit association test clarifies what it may and may not measure. “These results are not a definitive assessment of your implicit preference. The results may be influenced by variables related to the test (e.g., the category labels or particular items used to represent the categories on the IAT) or the person (e.g., how tired you are),” one disclaimer reads.
Smith expressed frustration with the way the public learns about the various facets of the test.
“At the end of the day, I can’t stop people from thinking what they’re going to think. You can get the papers you linked to by Googling ‘IAT bad’ so people are going to find arguments that suit their way of thinking. I don’t think that’s terrible, though I do worry about the differences between me thinking about this all the time for 16 years and someone doing a directed Google search,” he said.
Smith added that the researchers strive to ensure transparency and forthrightness regarding the test.
“I know that, at Project Implicit, we work to have everything we say on the website, in our talks, in our published research, etc. be as scientifically accurate as possible. We don’t say words like ‘racist’ ever, for example. I don’t think anyone in our group even uses ‘prejudice’ to refer to implicit measures. So, how do we deal with popular press articles that refer to ‘the Harvard racism test’? We each have only so much time and energy in any one given day,” he said.
IMAGE: Rob Wilson / Shutterstock.com