clock menu more-arrow no yes

Filed under:

Scientists are fighting about whether a major psychology study is totally wrong

New, 2 comments

Are psychology studies really that hard to repeat?

Alena Hovorkova/Shutterstock

Scientists are duking it out over whether a study on reproducibility has a reproducible conclusion. The high-profile study reported last year that many psychology findings can’t be replicated, but a separate group of researchers now claim the finding is severely flawed. Today in the journal Science, both sides of the debate are squaring off. One side says psychology has a problem. The other side says: actually, you just suck at analysis. The dispute all at once speaks to why we need more reproducibility in science and questions whether there really is a reproducibility crisis at all.

"Metascience is not exempt from the rules of science."

Confirming results is a critical part of the scientific process; it's how we tell what's actually true. To see how often the results of published studies can be reproduced, a coalition of researchers worked together to repeat 100 studies recently published in major psychology journals. Their finding — the one at issue — was that only 39 of the 100 studies could be replicated, potentially identifying a major problem: either bad science is being published, or repeating a study is far more complicated than we might expect.

In a critique of the finding, researchers say the initial study erred in several ways. First, it didn’t account for expected failures; it also relied on poorly-reproduced studies with small samples; and because the selection process for repeated studies wasn’t blind, it biased results. "Metascience is not exempt from the rules of science," they write in conclusion. They say the Open Science Collaboration (OSC), which oversaw the psychology study, "seriously underestimated the reproducibility of psychological science."

"I trust we all agree that reproducibility is difficult to estimate," Daniel Gilbert, a professor of psychology at Harvard and the critique's lead author, tells The Verge. "The OSC believes they overcame this difficulty well enough to estimate the replicability of psychology, and we do not agree." It’s even possible, the critique says, that the study shows reproducibility rates in psychology are just fine. If they had corrected for certain errors, they may have seen that.

"Even the top of the top scientists can disagree in interpretation of what are very solid results."

The OSC stands by its findings. They say that Gilbert and his colleagues' critique is "limited by statistical misconceptions" and selective examination of their data. "This is perfectly fine to do in the service of generating a hypothesis," Brian Nosek, corresponding author of the OSC's reply, tells The Verge. "However, it is not a good way to draw reliable conclusions." Nosek says that other strong conclusions could be drawn from their data using this approach, but they would be "as unjustified" as the one arrived at by Gilbert and colleagues.

So which conclusion is right? For now, there's no definitive answer. "Even the top of the top scientists can disagree in interpretation of what are very solid results," says John Ioannidis, a Stanford professor known for his 2005 paper, "Why most Published Research Findings are False." Ioannidis, who was not involved with the psychology study or its critique, says Gilbert's critique doesn't change his reading of the OSC's study: that, if anything, they may have overestimated the number of reproducible studies. "But constructive debate is always useful," he says.

The OSC always emphasized that the problem could lie in reproduction

In some sense, the critique underscores one of the points made by the original study. Only reading as far as the headline explanation of the OSC's findings might lead you to believe that most psychology is bunk; but in reality, the study never lands on a definitive reason for why so many findings can’t be reproduced. Gilbert and colleagues argue it's because the OSC's methodology was flawed and allowed for poor replications. The OSC's response: Yeah, that could totally be the case.

After all, one of the OSC’s original findings was that, since there’s no standard in place for what a study replication should look like, there's no way of agreeing upon what it means when a replication is unsuccessful. Was the original science wrong? Or did the replication differ too much? "If different results are observed from original, it could be that original is wrong, replication is wrong, or both are right in that a difference between them explains why they found different results," Nosek says. In his mind, the OSC's study doesn't have enough information to say which of the three it is.

Scientists are investigating reproducibility rates in other disciplines. In a separate study that's also being published in Science today, researchers report trying to replicate 18 recent studies from major economics journals, borrowing methodology from the OSC's psychology study. In 11 cases — 61 percent of the time — they were able to reproduce the original results; in three other cases, the results were close to being replicated, too. "The [replicability] rate we report for experimental economics is the highest we are aware of for any field," Juergen Huber, a finance professor at the University of Innsbruck and co-author of the paper, said in a statement. However, their results came from a much smaller sample of reproductions.

"Evidence accumulation is slow. Confidence in explanation is slower."

Gilbert would argue that, regardless of the field, taking a better approach to replication in the first place may should lead to clearer results. "Yes, replicating can be done well, and yes, doing it well is hard," he writes in an all-caps email to The Verge. "But just because it is hard to do something well does not mean that you should do it badly. This applies both to replication and playing the violin in public."

To the OSC, their work remains part of the puzzle. As they said from the start, there's no firm conclusion to be drawn from their results. "Evidence accumulation is slow. Confidence in explanation is slower," Nosek says. "Our confident conclusion is that there isn't enough evidence to explain the differences yet."