Over at dynamic ecology, Jeremy Fox argues that
Technical statistical mistakes are overrated; ecologists (especially students) worry too much about them. Individually and collectively, technical statistical mistakes hardly ever appreciably slow the progress of entire subfields or sub-subfields. And fixing them rarely meaningfully accelerates progress.
Don’t agree? Try this exercise: name the most important purely technical statistical mistake in ecological history. And make the case that it seriously held back scientific progress.
I would argue that nothing could be further from the truth. It’s actually no challenge at all to point out massive statistical problems that slow down progress in ecology, and not only because of this, but also simply because using inappropriate methods “is the wrong thing to do” for a scientist, I very much hope that students worry about this topic. Let me give a few examples
p-hacking and researcher degrees of freedom
Statistical errors must not always be massive and obvious to have an impact on the wider field.
IF A LOT OF SMALL PEOPLE IN A LOT OF SMALL PLACES DO A LOT OF SMALL THINGS, THEY CAN CHANGE THE FACE OF THE WORLD (possibly an African proverb, but surely a graffiti on the Berlin wall)
In the last years, there has been a widespread debate throughout the sciences about the reliability / replicability of scientific results (I blogged about this a few years back here and here, but there have been many new developments since – a recent collection of papers in PNAS provides a great, although somewhat broader overview).
The statistical issue I’m referring to is the impact of analysis decisions like
- Changing the hypotheses (predictor or response variables) during the analysis, e.g. trying out various combinations of predictors and response variables to see if the results are “improved” or what is “interesting”. This includes looking at the data before the analysis and deciding based on that what tests to make!
- Making data collection dependent on results, e.g. collect a bit more data if there seems to be an effect, or removing data if it seems “weird”, or here
- trying out different statistical tests and use those that produce “better” = more significant results
- etc. etc.
I think few people that are involved in teaching ecological statistics will dispute that these strategies, known as p-hacking, data-dredging, fishing and harking (hypothesizing after results are known) are widespread in ecology, and a large body of research shows that they tend to have a substantial impacts on the rate at which false positives are produced (see, e.g., Simmons et al., or the mind-boggling Brian Wansink story).
Could this be solved? Of course it could – the solution is well-known. For a confirmatory analysis, you need to fix your hypothesis before the data collection and stick with it. Best with a pre-registered analysis plan. I once suggested this to a colleague from an empirical ecology group, and was told “Are you crazy? If we did this, our students would never finish their PhD – the original hypothesis hardly ever checks out” … any questions about whether there are issues in ecology?
Side note – I’m all for giving exploratory analyses more weight in science, see e.g. here, but exploratory analysis = being honest about the goal. Fishing != exploratory analysis!
The second issue I’m seeing is that there are widely accepted analysis strategies in ecology that are statistically unsound. The best example I have is the analysis chain of
- Perform AIC selection
- Present regression table of the AIC selected model
What few people realize is that, while AIC selection alone is useful, and regression tables alone are useful as well, the combination of an AIC selection with a subsequent regression table is problematic. Specifically, in combination, the p-values in the regression table will generally be incorrect, because they do not account for the earlier AIC selection (how should they, your R command doesn’t know you did a selection). If you don’t believe me that this is a problem, try this
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
|# This example shows how AIC selection, followed by a conventional regression analysis of the selected model, massively inflates false positives. CC BY-NC-SA 4.0 Florian Hartig|
|dat = data.frame(matrix(runif(20000), ncol = 100))|
|dat$y = rnorm(200)|
|fullModel = lm(y ~ . , data = dat)|
|# 2 predictors out of 100 significant (on average, we expect 5 of 100 to be significant)|
|selection = stepAIC(fullModel)|
|# voila, 15 out of 28 (before 100) predictors significant – looks like we could have good fun to discuss / publish these results!|
The full model has correct type I error rates of approximately 5%. Here’s the result after model selection – let me remind you that none of these variables truly has an effect on the response. I am pretty certain that I could get such an analysis into an ecology journal, writing a nice discussion about the ecological sense of each of these “effects”, and why our results differ from some previous studies etc. bla bla. This is why I don’t (and neither should you) do model selection for hypothesis-driven analyses!
Inappropriate statistical methods
Finally, as a third category, let’s come to statistical methods that are fundamentally flawed in the first place. I could name a whole list of issues off the top of my head, including
- Fitting power laws by log-log linear regression on size classes, which produces biased estimates and significantly distorted efforts to test metabolic scaling theories (see an old post here).
- Regressions on beta diversity / community indices, which are notoriously unstable / dependent on other things; as well as regressions on network indices, which have the same problems. Lots of spurious results produced in these fields over the years. Incidentally, null models are not a panacea, although they help.
- And of course, there is a long list of papers that made good old plain mistakes in the analysis, whose correction completely changes the conclusions. Lisa Hülsmann and I have a technical comment forthcoming that will be discussed in a future post, but here is an old example.
What’s the impact of this on ecological progress?
You might point out that we still have to show that all this has an impact on ecological progress. It’s a tricky task, because the question itself leaves a lot of wiggle room – what is the definition of progress in the first place, and how would you know that progress has been slowed down, as long as money comes in and papers get published?
I know it’s not 100% fair, but let me turn this question around: if it didn’t matter for the wider field if what we report as scientific facts is correct or not, why go through all the painstaking work to collect data in the first place? By the same logic, I could write:
[irony on] Young people worry far too much about data collection, instead of just inventing data. I challenge you to name the most important data fabrication in ecological history. And make the case that it seriously held back scientific progress [irony off].
Moreover, I find it very hard to believe that there is no adverse effect of producing a lot of wrong results in any scientific field. In the best case, by creating noisy results, we’re less effective than we could be, burning money and slowing down a movement in the right direction. In the worse case, we could go into a wrong direction altogether, as it might have happened recently in psychology.
But even if there was no effect on the progress of science (which I think there is), I’d argue in good old greek tradition that using inappropriate tools and producing wrong results is simply not the right thing to do as a scientist. It’s undermining the ethics, aesthetics and professional practices of science, and regardless of whether it directly affects progress, I’m quite happy for any student that worries about using the appropriate tools!
ps: of course, one can worry about things that are not important. using a t-test on non-normal data is often not a big issue. But to know this, you have to worry first, and then test it out!
pps: I’m not saying that stats is the only thing one has to worry about. Good theory / hypotheses are another one of course, as is clear thinking. But I think stats + experimental design is quite central to getting science right.
[edit 6.5.18] after writing this post, I became aware of the study “Wang et al. (2018) Irreproducible text‐book “knowledge”: The effects of color bands on zebra finch fitness” which seems to show at least one example where a field maintains a wrong conclusion for due to lower power / research degrees of freedom / selective reporting, comparable to what’s going on in psychology.