SCIENCE

Many Psychology Findings Not as Strong as Claimed, Study Says

By BENEDICT CAREYAUG. 27, 2015

Staff of the the Reproducibility Project at the Center for Open Science in Charlottesville, Va., from left: Mallory Kidwell, Courtney Soderberg, Johanna Cohoon and Brian Nosek. Dr. Nosek and his team led an attempt to replicate the findings of 100 social science studies. CreditAndrew Shurtleff for The New York Times

The past several years have been bruising ones for the credibility of the social sciences. A star social psychologist was caught fabricating data, leading to more than 50 retracted papers. A top journal published a studysupporting the existence of ESP that was widely criticized. The journal Science pulled a political science paper on the effect of gay canvassers on voters’ behavior because of concerns about faked data.

Now, a painstaking yearslong effort to reproduce 100 studies published in three leading psychology journals has found that more than half of the findings did not hold up when retested. The analysis was done by research psychologists, many of whom volunteered their time to double-check what they considered important work. Their conclusions, reported Thursday in the journal Science, have confirmed the worst fears of scientists who have long worried that the field needed a strong correction.

The vetted studies were considered part of the core knowledge by which scientists understand the dynamics of personality, relationships, learning and memory. Therapists and educators rely on such findings to help guide decisions, and the fact that so many of the studies were called into question could sow doubt in the scientific underpinnings of their work.

“I think we knew or suspected that the literature had problems, but to see it so clearly, on such a large scale — it’s unprecedented,” said Jelte Wicherts, an associate professor in the department of methodology and statistics at Tilburg University in the Netherlands.

More than 60 of the studies did not hold up. Among them was one on free will. It found that participants who read a passage arguing that their behavior is predetermined were more likely than those who had not read the passage to cheat on a subsequent test.

Another was on the effect of physical distance on emotional closeness. Volunteers asked to plot two points that were far apart on graph paper later reported weaker emotional attachment to family members, compared with subjects who had graphed points close together.

A third was on mate preference. Attached women were more likely to rate the attractiveness of single men highly when the women were highly fertile, compared with when they were less so. In the reproduced studies, researchers found weaker effects for all three experiments.

The project began in 2011, when a University of Virginia psychologist decided to find out whether suspect science was a widespread problem. He and his team recruited more than 250 researchers, identified the 100 studies published in 2008, and rigorously redid the experiments in close collaboration with the original authors.

The new analysis, called the Reproducibility Project, found no evidence of fraud or that any original study was definitively false. Rather, it concluded that the evidence for most published findings was not nearly as strong as originally claimed.

Dr. John Ioannidis, a director of Stanford University’s Meta-Research Innovation Center, who once estimated that about half of published results across medicine were inflated or wrong, noted the proportion in psychology was even larger than he had thought. He said the problem could be even worse in other fields, including cell biology, economics, neuroscience, clinical medicine, and animal research.

The report appears at a time when the number of retractions of published papers is rising sharply in a wide variety of disciplines. Scientists have pointed to a hypercompetitive culture across science that favors novel, sexy results and provides little incentive for researchers to replicate the findings of others, or for journals to publish studies that fail to find a splashy result.

“We see this is a call to action, both to the research community to do more replication, and to funders and journals to address the dysfunctional incentives,” said Brian Nosek, a psychology professor at the University of Virginia and executive director of the Center for Open Science, the nonprofit data-sharing service that coordinated the project published Thursday, in part with $250,000 from the Laura and John Arnold Foundation. The center has begun an effort to evaluate widely cited results in cancer biology, and experts said that the project could be adapted to check findings in many sciences.

In a conference call with reporters, Marcia McNutt, the editor in chief of Science, said, “I caution that this study should not be regarded as the last word on reproducibility but rather a beginning.” In May, after two graduate students raised questions about the data in a widely reported study on how political canvassing affects opinions of same-sex marriage, Science retracted the paper.

The new analysis focused on studies published in three of psychology’s top journals: Psychological Science, the Journal of Personality and Social Psychology, and the Journal of Experimental Psychology: Learning, Memory, and Cognition.

The act of double-checking another scientist’s work has been divisive. Many senior researchers resent the idea that an outsider, typically a younger scientist, with less expertise, would critique work that often has taken years of study to pull off.

“There’s no doubt replication is important, but it’s often just an attack, a vigilante exercise,” said Norbert Schwarz, a professor of psychology at the University of Southern California.

Dr. Schwarz, who was not involved in any of the 100 studies that were re-examined, said that the replication studies themselves were virtually never evaluated for errors in design or analysis.

Dr. Nosek’s team addressed this complaint in part by requiring the researchers attempting to replicate the findings to collaborate closely with the original authors, asking for guidance on design, methodology and materials. Most of the replications also included more subjects than the original studies, giving them more statistical power.

Strictly on the basis of significance — a statistical measure of how likely it is that a result did not occur by chance — 35 of the studies held up, and 62 did not. (Three were excluded because their significance was not clear.) The overall “effect size,” a measure of the strength of a finding, dropped by about half across all of the studies. Yet very few of the redone studies contradicted the original ones; their results were simply weaker.

“We think of these findings as two data points, not in terms of true or false,” Dr. Nosek said.

The research team also measured whether the prestige of the original research group, rated by measures of expertise and academic affiliation, had any effect on the likelihood that its work stood up. It did not. The only factor that did was the strength of the original effect — that is, the most robust findings tended to remain easily detectable, if not necessarily as strong.

The project’s authors write that despite the painstaking effort to duplicate the original research, there could be differences in the design or context of the reproduced work that account for the different findings. Many of the original authors certainly agree.

In an email, Paola Bressan, a psychologist at the University of Padua and an author of the original mate preference study, identified several such differences — including that her sample of women were mostly Italians, not American psychology students — that she said she had forwarded to the Reproducibility Project. “I show that, with some theory-required adjustments, my original findings were in fact replicated,” she said.

These are the sorts of differences that themselves could be the focus of a separate study, Dr. Nosek said.

Extending the project to other fields will require many adaptations, not least because of the cost of running experiments in medicine and brain science. To check cancer biology results, for instance, the Center for Open Science will have to spend far more money than was spent on psychology.

Stefano Bertuzzi, the executive director of the American Society for Cell Biology, said that the effort was long overdue, given that biology has some of the same publication biases as psychology. “I call it cartoon biology, where there’s this pressure to publish cleaner, simpler results that don’t tell the entire story, in all its complexity,” Dr. Bertuzzi said.

Correction: August 29, 2015

A picture caption on Friday with an article about the findings of the Reproducibility Project, an assessment study of 100 published psychology papers, misspelled the surname of one of the researchers involved. She is Courtney Soderberg, not Soderbergh.