I’m shamelessly abusing the aftermath of the Stapel affair as an excuse to paint the beginning of this post in novelette colors. Admittedly, the case doesn’t exactly recommend itself for jokes, but the temptation was too great after noting that the combined final reports of Tilburg University (Levelt Committee), University of Amsterdam (Drenth Committee) and University of Groningen (Noort Committee), released a bit more than a week ago here, went by the title:
“Flawed science: The fraudulent research practices of social psychologist Diederik Stapel”
Who says science is boring and dry – this is big drama: a fraud of epic dimensions, the rise and fall of a parvenu, alone the love story seems to be missing here, but given the topic it would seem only appropriate to make it up in order to help the story.
The case of Diederik Stapel has been discussed beyond detail on blogs and science news elsewhere – for those who need to catch up: the report is actually quite readable, as is the nature news feature that I linked to at the beginning of this post, as well as further blog posts 1,2,3,4 that I found worth reading. Thus, I hope it will suffice as a summary to say that it recently became known and is now established beyond reasonable doubt that much, maybe most of the research conducted by Diederik Stapel, a by all accounts intelligent and talented young professor for social psychology last working at Tilburg University, was based on fabricated or manipulated data and inappropriate statistical analyses, which includes the data used for a number of PhD theses that were produced under his supervision.
Wider implications for science
Here, however, I want to concentrate on the wider implications of the Stapel affair, i.e. the culture under which this fraud emerged, and the reasons for the fact that it was uncovered so late, despite the signs that now, in retrospect, jump to the eye. The report of the Dutch universities, which is to be commended on spending considerable time on this topic, gives answers in two sections, “Research culture: working method and research environment”, which deals with the culture in the Stapel lab, and “Research culture: flawed science”, which deals with the general culture in social psychology.
Although the examination of the lab culture produced some peculiarities, I would say that most things seemed pretty much in the normal range to me, I have definitely heard more bizarre things, with the obvious exception that Mr. Stapel routinely and over many years fabricated and manipulated data for students and coauthors.
The bigger blame, however, is given to the culture in social psychology, which anyway had a bit of a hard time recently with other cases such as those of Marc Hauser or Dirk Smeesters. Science concluded in the title of their news story on the final report: “Stapel Affair Points to Bigger Problems in Social Psychology”. The New York Times headlined: Fraud Case Seen as a Red Flag for Psychology Research. And indeed, the Dutch report states that:
It is almost inconceivable that co-authors who analysed the data intensively, or reviewers of the international “leading journals”, who are deemed to be experts in their field, could have failed to see that a reported experiment would have been almost infeasible in practice, did not notice the reporting of impossible statistical results, … and did not spot values identical to many decimal places in entire series of means in the published tables. Virtually nothing of all the impossibilities, peculiarities and sloppiness mentioned in this report was observed by all these local, national and international members of the field, and no suspicion of fraud whatsoever arose.
There was a fair amount of discussion on the internet about whether singling out the social psychology community in particular is fair, given that there were similar cases of fraud in other disciplines as well. To cut a long story short, I think it is. The report gives plenty of evidence, partly well know though, for the fact that many of the softer methods of scientific misconduct used by Stapel were considered tolerable, if not normal in the community, and also the actual act of fabrication seems to have been favored by a lacking habit of doubt and reproduction. That is not to say that other research fields, among them ecology, might not be fighting with similar problems. However, all the more reason to look into these habits, instead of putting the head into the sand and pretending that these rare cases of fraud are decoupled to the community as a whole.
Verification bias and the curse of significance
The first problem that seems to have contributed to the fraud is what I call “the curse of significance” – the obsession with simple stories, null hypothesis testing and (significantly) positive results. The Dutch report, for example, states that
Reviewers have also requested that not all executed analyses be reported, for example by simply leaving unmentioned any conditions for which no effects had been found, although effects were originally expected. Sometimes reviewers insisted on retrospective pilot studies, which were then reported as having been performed in advance. In this way the experiments and choices of items are justified with the benefit of hindsight.
Not infrequently reviews were strongly in favour of telling an interesting, elegant, concise and compelling story, possibly at the expense of the necessary scientific diligence.
I have blogged about the ramifications of this attitude before: multiple testing, reporting only selected data and also the publication barrier for negative results inevitably lead to a distortion of the reported effect size in the scientific literature, which has been highlighted and examined by a number of recent studies such as John P. A. Ioannidis in his article “Why Most Published Research Findings Are False” or Simmons et al. in their article “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant”.
The medicine against this illness is pretty obvious: ideally, journals would decide on whether to publish a manuscript or not alone by looking at the question and the quality of the experiment without taking into consideration whether an effect is significantly different from zero (if you want, a blinding of editors). In general, I think we would do much better to focus on effect size and confidence intervals, not on testing the significance of a more or less arbitrary null hypothesis. This would ensure both a fair account of experimental results in the literature, important for meta-analysis, and remove a lot of incentives for researchers to “tweak” their results due to the current clear preferences for particular outcomes. Given that this has been suggested by so many before me, however, I don’t know whether we can hope for a major change in this area any time soon.
Open data, reproducibility and experimental reproduction
A point that might be easier to change is the lack of mechanisms that ensured reproducibility in psychology as in other fields of science.
An easy modification that I think would make researchers much more cautions about their statistics would be a general obligation to submit the raw data, together with the scripts that record the statistical analysis, with all published papers. Cases where “data was lost” would thereby be made impossible, and knowing that it is easy for others to rerun the analysis and/or look at their raw data for evidence of manipulation such as Uri Simonsohn in the Smeesters case would, I think, be a strong barrier against directly manipulating data or the statistical analysis.
Moreover, reproduction of research results should be valued higher by the research community. Currently, a repetition of a study is essentially unpublishable because of a “lack of novelty”. I would say that if a question is really interesting, a repetition should be nearly as interesting as the original paper, in particular if it doesn’t produce the same results. Journal editors could make a big difference here – if reproducing research would be rewarded with publication in high-ranked journals, it would be done, and if it would be done, fabrication, manipulation, and honest mistakes would be detected much earlier, with the result that we would be on much firmer ground regarding what we know.
Asymmetric power and the risk of whistle blowing
Finally, a big factor in the game was, as often, the understandable reluctance of researchers to make enemies among their colleagues and superiors for fear of personal disadvantages. I think every PhD student understands that accusing his supervisor, even with good reason, might result in major disadvantages for his further career. The Levelt Committee noted:
On the one hand Mr Stapel had an intensive one-to-one working relationship with the young researchers, and many PhD students viewed him as a personal friend. They visited his home, had meals together, went to the cinema, and so on. On the other hand, however, were the threats when critical questions were asked. It would then be made clear to the PhD student concerned that such questions were seen as a lack of trust and that none should be asked. It was precisely the close relationship with Mr Stapel that made it difficult for a junior researcher to see anything in this other than well-intentioned constructive criticism from the senior partner.
I think it is important to keep in mind that, at this point, probably none of the involved PhDs and PostDocs could anticipate the whole extent of the fraud, and that an inquiry would end with such a crystal clear conviction of Mr. Stapel. Likely, they were thinking about the possibility that Mr. Stapel would be able to defend himself in a way that allowed him to stay in office, even if parts of their accusations were found to be true, with the obvious consequences for their own careers. The same goes for academic staff that was made aware of the fraud – nothing was to be gained by following up on the allegation of fraud, but much could be lost. The Levelt Committee notes:
In 2010 and 2011 three mentions of fraud were addressed to members of the academic staff in psychology. The first two were not followed up in the first or second instances. Mr Stapel’s virtually unassailable position may have played a part. The third report, to the head of department, and extraordinarily carefully prepared by three young and, certainly in their position, vulnerable whistleblowers, was immediately picked up in a professional way, with the now familiar result.
Suspicions about data provided by Mr Stapel had also arisen among fellow full professors on two occasions in the past year. These suspicions were not followed up. The Committee concludes that the three young whistleblowers showed more courage, vigilance and inquisitiveness than incumbent full professors.
Fraud, fabrication and scientific misconduct, albeit not pervasive, is common – Danielle Fanelli reports in PLoS One that
… on average, about 2% of scientists admitted to have fabricated, falsified or modified data or results at least once –a serious form of misconduct my any standard […] and up to one third admitted a variety of other questionable research practices including “dropping data points based on a gut feeling”, and “changing the design, methodology or results of a study in response to pressures from a funding source”. In surveys asking about the behaviour of colleagues, fabrication, falsification and modification had been observed, on average, by over 14% of respondents, and other questionable practices by up to 72%.
I think many people don’t even realize that certain practices are wrong. A bit of data massaging is viewed as part of how the game is played, in the same sense as a wisely used foul is, while not strictly legal, considered part of the game in professional football. I don’t think that it is of much use to appeal to higher ideals to tackle this problem – of course, science is and must be based on trust, but the research community is growing larger, and so does the pressure on many individuals, and anyway, misconduct has been there since the early days of science. Instead of idealizing science, we should try to design an institutional structure that is able to maintain and confirm this trust. While there is certainly no panacea, small things could make a difference, such as reducing the obsession with significance, opening up data and algorithms, encouraging reproduction, and investing more in the independence of researchers.