Yes, statistical errors are slowing down scientific progress!

Over at dynamic ecology,  Jeremy Fox argues that

Technical statistical mistakes are overrated; ecologists (especially students) worry too much about them. Individually and collectively, technical statistical mistakes hardly ever appreciably slow the progress of entire subfields or sub-subfields. And fixing them rarely meaningfully accelerates progress.

continuing with

Don’t agree? Try this exercise: name the most important purely technical statistical mistake in ecological history. And make the case that it seriously held back scientific progress.

I would argue that nothing could be further from the truth. It’s actually no challenge at all to point out massive statistical problems that slow down progress in ecology, and not only because of this, but also simply because using inappropriate methods “is the wrong thing to do” for a scientist, I very much hope that students worry about this topic. Let me give a few examples

p-hacking and researcher degrees of freedom

Statistical errors must not always be massive and obvious to have an impact on the wider field.

IF A LOT OF SMALL PEOPLE IN A LOT OF SMALL PLACES DO A LOT OF SMALL THINGS, THEY CAN CHANGE THE FACE OF THE WORLD (possibly an African proverb, but surely a graffiti on the Berlin wall)

In the last years, there has been a widespread debate throughout the sciences about the reliability / replicability of scientific results (I blogged about this a few years back here and here, but there have been many new developments since – a recent collection of papers in PNAS provides a great, although somewhat broader overview).

The statistical issue I’m referring to is the impact of analysis decisions like

  • Changing the hypotheses (predictor or response variables) during the analysis, e.g. trying out various combinations of predictors and response variables to see if the results are “improved” or what is “interesting”. This includes looking at the data before the analysis and deciding based on that what tests to make!
  • Making data collection dependent on results, e.g. collect a bit more data if there seems to be an effect, or removing data if it seems “weird”, or here
  • trying out different statistical tests and use those that produce “better” = more significant results
  • etc. etc.

I think few people that are involved in teaching ecological statistics will dispute that these strategies, known as p-hacking, data-dredging, fishing and harking (hypothesizing after results are known) are widespread in ecology, and a large body of research shows that they tend to have a substantial impacts on the rate at which false positives are produced (see, e.g., Simmons et al., or the mind-boggling Brian Wansink story).

Could this be solved? Of course it could – the solution is well-known. For a confirmatory analysis, you need to fix your hypothesis before the data collection and stick with it. Best with a pre-registered analysis plan. I once suggested this to a colleague from an empirical ecology group, and was told “Are you crazy? If we did this, our students would never finish their PhD – the original hypothesis hardly ever checks out” … any questions about whether there are issues in ecology?

Side note – I’m all for giving exploratory analyses more weight in science, see e.g. here, but exploratory analysis = being honest about the goal. Fishing != exploratory analysis!

Analysis strategies

The second issue I’m seeing is that there are widely accepted analysis strategies in ecology that are statistically unsound. The best example I have is the analysis chain of

  1. Perform AIC selection
  2. Present regression table of the AIC selected model

What few people realize is that, while AIC selection alone is useful, and regression tables alone are useful as well, the combination of an AIC selection with a subsequent regression table is problematic. Specifically, in combination, the p-values in the regression table will generally be incorrect, because they do not account for the earlier AIC selection (how should they, your R command doesn’t know you did a selection). If you don’t believe me that this is a problem, try this

The full model has correct type I error rates of approximately 5%. Here’s the result after model selection – let me remind you that none of these variables truly has an effect on the response. I am pretty certain that I could get such an analysis into an ecology journal, writing a nice discussion about the ecological sense of each of these “effects”, and why our results differ from some previous studies etc. bla bla. This is why I don’t (and neither should you) do model selection for hypothesis-driven analyses!

Screen Shot 2018-05-03 at 9.36.48 AM.png

Inappropriate statistical methods

Finally, as a third category, let’s come to statistical methods that are fundamentally flawed in the first place. I could name a whole list of issues off the top of my head, including

  • Fitting power laws by log-log linear regression on size classes, which produces biased estimates and significantly distorted efforts to test metabolic scaling theories (see an old post here).
  • Regressions on beta diversity / community indices, which are notoriously unstable / dependent on other things; as well as regressions on network indices, which have the same problems. Lots of spurious results produced in these fields over the years. Incidentally, null models are not a panacea, although they help.
  • And of course, there is a long list of papers that made good old plain mistakes in the analysis, whose correction completely changes the conclusions. Lisa Hülsmann and I  have a technical comment forthcoming that will be discussed in a future post, but here is an old example.

What’s the impact of this on ecological progress?

You might point out that we still have to show that all this has an impact on ecological progress. It’s a tricky task, because the question itself leaves a lot of wiggle room – what is the definition of progress in the first place, and how would you know that progress has been slowed down, as long as money comes in and papers get published?

I know it’s not 100% fair, but let me turn this question around: if it didn’t matter for the wider field if what we report as scientific facts is correct or not, why go through all the painstaking work to collect data in the first place? By the same logic, I could write:

[irony on] Young people worry far too much about data collection, instead of just inventing data. I challenge you to name the most important data fabrication in ecological history. And make the case that it seriously held back scientific progress [irony off].

Moreover, I find it very hard to believe that there is no adverse effect of producing a lot of wrong results in any scientific field. In the best case, by creating noisy results, we’re less effective than we could be, burning money and slowing down a movement in the right direction. In the worse case, we could go into a wrong direction altogether, as it might have happened recently in psychology.

But even if there was no effect on the progress of science (which I think there is), I’d argue in good old greek tradition that using inappropriate tools and producing wrong results is simply not the right thing to do as a scientist. It’s undermining the ethics, aesthetics and professional practices of science, and regardless of whether it directly affects progress, I’m quite happy for any student that worries about using the appropriate tools!

ps: of course, one can worry about things that are not important. using a t-test on non-normal data is often not a big issue. But to know this, you have to worry first, and then test it out!

pps: I’m not saying that stats is the only thing one has to worry about. Good theory / hypotheses are another one of course, as is clear thinking. But I think stats + experimental design is quite central to getting science right.

[edit 6.5.18] after writing this post, I became aware of the study “Wang et al. (2018) Irreproducible text‐book “knowledge”: The effects of color bands on zebra finch fitness” which seems to show at least one example where a field maintains a wrong conclusion for due to lower power / research degrees of freedom / selective reporting, comparable to what’s going on in psychology.

 

14 thoughts on “Yes, statistical errors are slowing down scientific progress!

  1. Florian, I agree!

    Jeremy’s point is difficult for various reasons.

    Firstly, how do we know that ecology has indeed progressed, if a lot of it is (presumably or demonstrably) false? (Reference point: Ioannidis 2005 PLoS Medicine holds just the same for ecology as for medicine.)

    Second, as a case in point, I would argue that virtually all publications on nestedness in interaction networks are either wrong, confused or both. (They started by showing that nestedness exists, which is not at all surprising; then by making the point that nestedness is LARGER than expected, which would be surprising; and then by showing that nestedness is LESS than expected. All these papers still have to make the point that this IS actually ecologically relevant. And no end in sight.)

    Third, at some workshop, we discussed the importance of so-called “bullshit papers”, i.e. publications that are wrong, but sparked the debate about a topic (think the retracted micro-plastics paper in Science 2017). One senior editor made the point that such BS-papers are great, because they focus science. I argued that they are a waste of time, forcing people to disprove something that was never correct in the first place. This time is lost to actual “progress”, whatever that is.

    Fourth, and closing the circle: what is progress in ecology? Having more data on some case studies? Or, as Jeremy has argued in a different blog post (and indeed his “Zombie theories” paper in TREE), is progress if hypotheses are purged from our ecological canon if they are untenable? Regrettably, as Jeremy pointed out, we are not good in purging rotten hypotheses. So maybe ecology doesn’t progress. But if an hypothesis is eventually abandonned, I believe that a lot of reproducible statistical analyses and low r2-values probably have contributed to its demise.

    P.S.: There was at least one point missing in your P.S.s: Quality of data. The recent rage about remote-sensing data and eDNA show how easily data can move a research agenda into what is possible to quantify, not what is sensible to quantify. “No carabid beetle diversity from remote sensing, no fungal diversity from eDNA (due to contamination with historic DNA)? What the heck: just focus on what we can quantify instead, no matter how trivial!”

    Like

    • Hi Carsten,

      yes, I agree with all that, including the point that sparking a debate is not necessarily equal to progress. A similar, albeit somewhat more complex example than micro-plastics would be Neutral Theory.

      I’d put a bit different emphasis on the RS and eDNA point though – I guess it’s inevitable that the emergence of any new (big) data source will trigger a bunch of correlation studies of questionable use (think about general sequencing data), but I am still confident that such new methods that can acquire large datasets are reasonable costs will be immensely useful in the long run, once we have figured out how to use them in a sensible way.

      Like

  2. Pingback: Os perigos do abuso da estatística e da modelagem matemática por ecólogos – Sobrevivendo na Ciência

  3. Hi Florian, it might surprise you, but I agree with most of what you said. In fact I’ve railed against AIC model selection, throwing a ton of variables into a model without thinking, and while its more Jeremy’s schtick I agree that p-hacking is a problem.

    But I think what I have gotten at with statistical machismo (like the phrase or not) and what I read Jeremy’s blog is a bit different.

    I think of p-hacking and model selection as inferential errors (drawing bad conclusions when the statistics are done technically correctly) vs not using phylogenetic regression would be a statistical error (a violation of assumptions of the statistical model). But then the question becomes statistical errors almost always occur (there is almost always a degree of violation of assumptions) so we need to become better at figuring out when those violations are consequential or not (sometimes they are, sometimes they are not).

    But I 100% agree with you that model selection and p-hacking are bad for ecology. Whether they have caused us to go down the wrong path or just wasted a lot of resources I don’t know. I lean to the latter.

    Like

    • Hi Brian, no, not surprised at all, but delighted everyone is on the same page regarding AIC selection.

      Regarding your comments, also on https://dynamicecology.wordpress.com/2018/05/02/what-are-the-most-important-technical-statistical-mistakes-in-ecological-history-and-were-they-all-that-important/#comment-70111 where you say that Jeremy is not after the small fish, but after errors that

      (b) misdirect the field and cause many authors not on the original paper to pursue a question they otherwise wouldn’t have.

      (c) not just wasted time but caused the overall consensus of the field to be wrong (not inefficient but wrong).

      –> I guess the problem with discussing this is agreeing on what counts as “misdirect”, or “wrong”? Not a statistical example, but did UNTB misdirect the field? What about May 1973? Or, on the more statistical side, did network or beta analyses misdirect the field? Did the studies about scaling laws misdirect the field?

      I think in each of these cases, the early conclusions of these fields can be considered “wrong”. The problem is that, as Carsten points out, once an idea becomes influential, stuff gets refined, and in the end we inevitably see the original paper as useful, regardless of whether its original conclusions were correct or not.

      In that reading, OK I can agree with your points, but then it becomes very hard to come up with an example for b/c, because by definition either we don’t yet know that some important insight was wrong, or if we know, it’s also fine because now we known, and it helped the field to move forward.

      Maybe we can rejoin by saying it’s relieving to see that science is able to make progress, despite making a lot of errors (not only statistical). I think a similar argument was made recently in “Scientific progress despite irreproducibility: A seeming paradox” http://www.pnas.org/content/115/11/2632

      I guess what I reacted to most was that the original post by Jeremy could be read as an apology of those practices, which was probably not intended.

      Like

      • Interesting question about whether theoretical results (like neutral theory or Mays complexity creates instability results) can misdirect a field. I would say yes, but as you say you can also argue that they advanced the field by being wrong.

        I guess I read Jeremy’s post as primarily asking whether major errors are more likely to be due to statistical mistake or other types of mistakes.

        But I definitely agree that “it’s relieving to see that science is able to make progress, despite making a lot of errors (not only statistical)”

        Like

  4. Very interesting post!

    The Analysis strategies topic and Marco’s comment on Dynamic Ecology (https://goo.gl/6vwM2h) got my attention back to model selection procedures. Here, if I understood correctly, you pose that combining model selection based on information criteria (IC), then checking for statistical significance in the selected model would be a mistake. But I have seen this approach so many times. One can manually conduct a forward or backward selection, use an IC to compare biologically relevant models and present a plot and a summary table from the selected model. Then, what would be the best approach? As a graduate student in ecology, I’ve been reading Alain Zuur and Ben Bolker’s books. Do you have any other references on model selection?

    Thanks in advance.

    Like

    • Yes, exactly, it’s done all the time, and I know it’s presented like this by Zuur. I’m not sure, is the Bolker book advocating this as well?

      All I can say is that the numbers don’t lie – you can run the simulation yourself, and anyway, it’s a well known problem in statistics. Search for “post selection inference”.

      What to do – I think this is more a topic for a post on it’s own. In brevity, 3 comments:

      1) In many cases, MS is not necessary at all. People run MS despite having enough data to support their full model. Only move away from the full model if you have to.

      2) In statistics, you will find ideas how to correctly calculate p-values after MS under “post selection inference” … there are some options, but afaik, they are not readily available in R

      3) I personally advocate the use of shrinkage estimators (lasso / ridge) in a Bayesian framework, because of their better error properties. The Bayesian is useful because also for shrinkage estimators, there are problems with calculating p-values and CIs, but a Bayesian posterior is relatively straightforward to obtain and interpret.

      Liked by 1 person

  5. Pingback: Friday links: overly honest faculty job ads, Jeremy’s “DJ name” revealed, and more | Dynamic Ecology

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s