Dear fellow ecologists, dear ladies and gentlemen,

it’s an honour to be given the opportunity to talk to you about something I deeply care about. As you will see, it is not a scientific talk, but rather reflections on the current state of ecology as a science, and more specifically, a science to be taken seriously by policy. As the title of the talk suggests, my impression is that “applied ecology” is a much larger body of ecological research than “academic ecology” or “fundamental ecology”. And this “applied ecology”, in my view, does not build on ecological knowledge, but only pretends to do so. It is thus a case of the tail wagging the dog.

Let me start with two anecdotes, one COVID-related, one about oil. At the rise of the B.117 variant of COVID (now called “alpha”), early indications were that it would transmit 70% better, but not increase mortality. BBC’s “More or less” presenter Tim Harfort put the following statement to his interviewee: “Thank god it is only 70% more transmissible, not 70% more deadly!” She put him right: “Actually, 70% more transmissible will lead to far more deaths than 70% more deadly, as the disease will infect much more people; and a higher exponential increase in infections multiplied with a constant proportion of deaths is worse than a lower exponential increase multiplied by a higher proportion of deaths.” “Mortality” sounds worse than “transmission”, but it is the effects that count, not the word.

The point of this anecdote: our intuition may easily fool us; we have to do the maths.

The second anecdote is about “peak oil”. Back in 1956, when a geologist called M. King Hubbert coined the term “peak oil”, this was about the rate of oil production in an oil-dependent world, not about greenhouse gas emissions. Having analysed oil production data, he predicted its peak for the 1970s. The fundamental flaw of his concept became apparent in the following years: oil production and even oil exploration was heavily demand-driven; oil field exploration was simply too expensive at the low oil prices of the 1950s, but picked up dramatically during the oil crisis of the 1970s. The higher oil price made new extraction technology affordable, such as deep-sea drilling or oil sands. The moral: it was an economic pattern that Hubbert wrongly attributed to geology.

The motivation for these anecdotes will hopefully become clear in due course.

The leading question behind this talk is “Is Ecological Science fit for policy?” One difficulty in answering this seemingly simple question lies in the huge variability of what we refer to as “Ecology”. As insiders of this discipline, we, all of us, are likely to be aware of which sub-fields that are more developed and which lack scientific rigour; which are very close to application, even to engineering; which aloof and theoretical; which that are mainstream in our teaching, and which are considered orchidean, even esoteric.

Looking from the outside, be it as an environmental manager in a big company or a governmental body tasked with countering the loss of biodiversity, such internal rifts and nuances are much less apparent. Ecology, as a science, benefits in the perception of the outside world from those scientific disciplines already part of policy since many decades: law, economics, medicine, engineering, history. Outsider as I am to all of these, I still suspect they harbour a similar diversity of schools, thoughts, ideas and approaches, and dare I say it, competences, as our own field of Ecology. However, the pressure of demand from policy has forged a subliminal understanding, within each of these fields, that in order to retain influence over political and societal decisions, critical and dissenting voiced must be channelled to and reflected within disciplinary conferences and workshops; they must not be allowed to taint the glossy image that each of these disciplines presents of itself to the world beyond.

I am no historian, no economist and also no sociologist. So these thought are largely speculative, although formed by discussions with colleagues from exactly these “established” policy-forming fields of science. I was amazed, when a relative of mine, who studied Mechanical Engineering, literally bit his tongue when I criticised an engineering blunder made by one of his colleagues. He agreed, heavyhearted and only after several glasses of wine, but his upbringing in his profession triggered immediate defence mechanisms against any such accusation. My criticism, which only repeated points made in a newspaper article I had read, touched his feeling of belonging to the accused profession. He identifies very strongly with Mechanical Engineering and hence any critique of his profession, justified or not, is a criticism of himself.

Again, I cannot speak for you, dear audience, but I personally hardly feel any defence reflex when I hear a colleague being taken to task by the media. More often than not, I am actually rather pleased when another overblown statement by a fellow ecologist is shredded by skeptical journalists and meticulous investigation by a layperson. It may be part of my misanthropic character, but I rather suspect that scientists by and large are individualistic and quirky and hence do not easily feel a strong sense of companionship with somebody, just because that somebody happens to have studied the same subject. So what I guess I am saying is: ecologists don’t make for a tight-knit bunch such as legal scholar, engineers or medical doctors.

Does that matter? To see that it does, imagine you were asked to review a report submitted to a federal legislative body on, say, the effect of insect decline on agriculture, forestry and natural ecosystems in your country. This topic is so broad, that many of us would feel qualified to comment at least on parts of such a report. Now imagine further, the report suggests that a future with an 80% loss of insect biomass would actually be beneficial to agriculture, leave forestry largely unaffected, but would be catastrophic for natural ecosystems. What would your comment be, and why?

I suggest that your comments would hugely depend on your own background in ecology and whether you know the background of the scientists who wrote the report. Your background is important because it will show whether you look at insect decline as a loss of pollinators, or as a loss of crop pests. And their background is important, as it allows you to better agree or disagree with their way of thinking: “Ah well, Christian Ammer always looked at it this way.” If my surmise is correct, we base our acceptance of a scientific report not on its content, but on how well it chimes with our view of the world. And since every ecologist has a different outlook, one cannot expect them to agree.

Scientific consistency is important at the science-policy interface, because it represents a check against **advocacy bias**. Why would any scientist agree to write a policy report? Seems an obvious waste of time! There is a very good reason why ecologist *would* in fact be willing to provide an answer to such vaguely scoped question: the lure of influence! The chance to change the dismal state our natural world to the better! The opportunity to bring to the forefront of legislation a topic so important to the biological world as climate change or loss of biodiversity or eco-farming or groundwater pollution or … you name it.

Advocacy is common to all walks of science. It becomes problematic in a judicial sense when the expert witness, and this is what such a report stands for, is not reporting impartially. In court, any witness with a clear grudge against the accused would clearly be incredible. Similarly, advocate ecologists become incredible as scientific witness. Well, I am not suggesting that we should not have an opinion, but we should have a system in place that provides checks and balances for whatever opinion we may have *when making scientific statements*. And that is what legal scholars, engineers and even economists have: a way to make a statement credible to their community. I imply, and state explicitly, that we ecologists do not have such a system. *My* statement will rub you up the wrong way, and I would not defend *your* statement to my grandmother! And rightly so, I guess.

The above-mentioned tightly-knit disciplines sing from the same hymn sheet. They may disagree on many aspects, but they have a canonical training so that every scholar knows the way the others in the field think, which arguments have weight, which methods are reliable. Legal scholars may disagree, but they are convinced that their profession is sound: *why*? Because they *know how each other lawyer thinks*. Their mindset is similar, their philosophy and logic is trained and tested during their education to follow the same lines and standards!

Returning to the insect decline-report, I would not trust any of my colleague’s statements *unless* I know how they think and work. In Ecology, some such statements are “expert guesses”, others are based on lengthy meta-analyses. Some consider “unnatural”, genetic-engineering alternatives, others mix in their love of six-legged creatures. Some will regard the problem abstractly, theoretically, mathematically, others will rely on experimentation. But across this diversity of justifiable approaches, which are sound enough to make a sweeping statement? Does a genetically-minded entomologist accept the approach of a theoretician? The experimentalist that of a conservation biologist? I suggest they would not. I believe we have quite a bit of faith in *our* approach, but less in *theirs*. We do not understand *their *approach enough, and we weren’t forced in our training to actually understand it. I claim that we do not know which approach to take for a given problem, which approach is the most likely to yield the correct answer. We weren’t taught the tools of our own trade! And if we were, we still would not employ them, but hope that *our* approach would be qualitatively good enough.

When we face global problems, who knows what will be the best approach? There are issues of scaling-up local processes, which we can only tackle theoretically. But there are deficits in our process understanding which we can resolve only through experimentation. It is no fluke of human behaviour that we have sub-disciplines in virtually every field of science: it is a necessity.

At the same time, any scientific field, including Ecology, must have coherence. If it has not, it remains a loose assortment of ideas and methods, like medieval medicine: stubbornly sticking to traditional practice such as blood-letting despite ample evidence of its harmfulness and in the face of substantial progress in physiological understanding. Such science not only performs worse than it could, it also absolves malpractice.

In medicine and physics, coherence came with undeniable evidence of superiority of correct science. Pasteur’s sterilisation “simply worked” for preventing wine from turning sour, irrespective of whether winemakers shuddered at the idea of cooking their product. When you lose 30% of your business to germs, you happily embrace some short heating under pressure, and hide this practice under the term “pasteurising”.

My view on Ecology is that of an incoherent discipline. While we have ecology textbooks, particularly Begon/Harper/Townsend, Krebs, Wittig/Streit and Nentwig/Bacher/Brandl, I dare to claim that few in this room were actually forced to understand them in each detail. Ecological teaching is still heavily driven by anecdote and personal experience. In particular theoretical ecology and ecosystem ecology are given short shrift in most universities, because they don’t seem to contribute to progressing Ecology. In historic analogy, the Maxwell equations of electromagnetism were ignored by experimental physicists, because they were difficult. Only because they *worked magnificently,* they managed to become staple food for any modern physicist. Being right is a powerful argument.

My dire reading of the state of Ecology is that we work in a specific way, and not in another, because that’s what we were trained in. Epistemologically, that’s a disaster! If we want to solve a problem, we should use the best possible method, not the most convenient. That, in turn, requires that Ecology students must be taught the full breath of topics, approaches, ecosystems and organisms. Handling mice in the field, identifying spiders, simulating partial differential equations *and* measuring marine carbon fluxes. This is not the place, and I am not the right person, to develop the elements of such an Ecology curriculum. But I strongly believe that we need a dramatic overhaul, a canonical curriculum of Ecology, based on concepts, skills and techniques that generalise, not on the individual competences of the lecturer.

I think that such a canonical curriculum is needed because of a side-effect, which motivated this talk today. We would start to understand each other! If we all knew how to carry out behavioural experiments with hoverflies, we would be able to evaluate the importance, or lack thereof, of this approach. If we had all learned how to take field measurements of decomposition rates, we would be better positioned to assess models of ecosystem carbon balance. We might not like doing it, but in the interest of a comprehensive understanding of natural processes we ought to do what is best for science. How shall we ever make progress if we turn around at every wall that comes in our path? Think of the bias we produce by accepting coarse approximations because we can’t be bothered to do it right? The catch-phrase here is “Better some poor estimate than none at all.” If this statement were correct, and I doubt it, it is only correct transiently; it implies that we have to strive for better estimates, not give up and run with poor estimates.

I hear your objecting. But your arguments are advocative, not epistemological. That is, you argue for Ecology to have an impact, not for Ecology to know more.

So, that was my analysis. What’s for therapy?

To proceed with making ecology an applicable science, four elements are required:

(Initiative 1. Define a set of principles, laws, theories, fundamentals.)

(Initiative 2. Define a set of procedures, models, rules to apply these principles to specific situations)

(Initiative 3. Evaluate the track record)

(Initiative 4. Identify the limit of competence and possible system understanding)

Element 1: Define a set of principles, laws, theories, fundamentals.

Ecological systems differ hugely. Arthropod abundance in tropical forests are very different from particle absorption in urban parks, yet both must be investigated from the same scientific stance. What are the “fundamentals” that allow ecologists to make educated guesses about either? How can a Singapore-trained ecologist formulate expectations about how the marine ecosystem of the Baltic functions? While arguably Ecology has no “law”, we are not theoretically ignorant either. Ecological systems, with very few exceptions, build on energy captured by plants, consumed by herbivores (if we allow fungi to be included in this category as well), which in turn are consumed by higher-order consumers. More energy input can only be translated into more biomass in all these trophic levels, if plants are not limited by nutrients. And such constraints exist at each level, and they are heavily physiological, and not species-specific. Now, I cannot claim to have developed such a set of principles, but others have contributed substantially to it, and whether you pick up the book by Mark Vellend or Michel Loreau, the key is here to see the fundamental ingredients of ecological theory, and not be sidelined by the many cases where we find these principles do not yield the full picture. My point is that we need to know what our building blocks are, before we build an expectation. Of course, these fundamentals are also fundamental to teaching Ecology.

So, first element: theoretical principles. But what do we do with them? The second element is thus a “**identify a set of rules**, procedures, a workflow of how to approach an ecological problem by applying the building blocks”. An engineer, building a bridge, is fully aware of a long checklist of engineering rules: statics, material engineering, computing resonance frequencies, calculate torsion, wear and tear, maintenance intervals, and, of course, legal requirements. What is on the checklist of ecology?

Any approach to a problem in ecology, fundamental or applied, can be expected to start with a few reasonable basics: what are the main compartments of the system?; what are the main fluxes of energy, water, carbon?; what are the main constraints for plant and animal population growth; what the limiting resources for the subsystem of interest? In our human-dominated world, I deem it reasonable to ask which of these pools and fluxes is most strongly affected by human activity, how our input and harvesting scales relative to natural processes. Probably we may want to elucidate whether individual species have disproportionate abundance, and whether there is communality in traits within each relevant community. The spatial boundary of the system of interest should be delimited, and the temporal scale of dominant fluxes must be discovered. All that is probably what many ecologists do anyway, without giving it much thought. Doing it a bit more formally we may avoid barking up the wrong tree, such as identifying hundreds of tree species in a tropical forest when the leading process of change is fire.

So, theoretical principles, and rules for addressing ecological problems. But will this lead to better application? The third element is thus indeed to evaluate our recommendations, to **build a track record**.

The technical term here is “to build an evidence base”. Whatever our ecological fore-bearers and ourselves have suggested in ecological applications needs to be evaluated. Did it work out? Were suggestions revised, adapted, withdrawn? Did the system respond in roughly the way we anticipated, or did some element of surprise render all our suggestions void?

Building such an evidence base has substantial scientific value, as it allows us to compare our guesses (call it “predictive hypotheses”, if you prefer) with reality. When did we get it right? Which element was missing on our step-2-checklist? Were first-step fundamental violated? So we can learn from our guesses, particularly those that went wrong.

And we can also show off those that went right! We can be the bridge architect, who can point to reference works across rivers and canyons. “Nothing is as sexy as success.” as the saying goes, and sure we all want Ecology to be sexy!

Really, evaluating evidence is a no-brainer, an obvious win for ecological science and its application.

But we must also **know when to stop**! Having identified a set of principles, defined a checklist of rule, and evaluated their usefulness, we must also learn when to shut up. Our knowledge, our system understanding can only take us so far, given the enormous complexity of virtually any ecological system. It would be unreasonable to expect detailed quantitative long-term predictions for specific species, for example. When multiple drivers vie for influence over a specific system, how can we possibly predict which way evolution and randomness will move it? Delimiting the domain of our knowledge, in time, in space, in detail, is crucial for credible applications. If we pretended to know what the central European landscape will look like in 2100, as the scientific literature sometimes seems to suggest, we should consider ourselves clairvoyant, not scientists. No economist is expected to know what the market will be like in 20 years, so why should we strive for the impossible? It is not a sign of weakness to admit unpredictability, or ignorance, but an act of credibility.

So, there you have it: I believe that ecology is not ripe yet for policy advise on a big scale:

1. We have no common ground to keep in check the advocative bias of scientific statements at the policy interface. At present, we cannot reliably tell between an advocate and an impartial witness, and the same person cannot be both.

2. We have no track record of Ecology-led advise to be superior to “common sense”. I believe it is, but we need the facts to show off with.

3. Our discipline, Ecology, lacks coherence due to a lack of a canonical academic curriculum.

Beyond provocation, I hope this talk has hinted at some possibly ways forward. I firmly believe that we need to provide the next generation of ecologists with a much more comprehensive education, one that teaches fundamental principles, application rules, evidence-assessment and domain limitations. A curriculum that allows ecologists to talk a common language irrespective of whether they studied in Kiel or Konstanz, in Kiew or Kyoto, and whether their main interest lies in carbon sequestration or carabids.

I hoped during this talk I managed to step on some toes; my aim was to challenge the current (applied and fundamental) ecological *modus operandi. *Thank you very much for enduring my opinionated statements, particularly in this very uncommon and hopefully never-to-be-repeated online experience!

Featured image by Hebrew Matio, CC BY-SA 4.0 via Wikimedia Commons

]]>*In October 2021, Paul Ehrlich wrote a “Correspondence” in Nature with which I seemed to sympathise to a surprisingly large part. But the last sentence wiped out all agreement, as it revealed a fundamentally different view of the role and “duty” of “scientists”. I wrote this blog at that time, and stashed it away. Now, a year later, with an actual war raging in a country close by, Ehrlich’s rhetoric seems more misplaced than ever. Here’s what I wrote then.*

**“Scientists are in a war for the future of humanity: they must get off of their peacetime footing.” Paul Ehrlich in Nature (https://doi.org/10.1038/d41586-021-02751-9)**

The elephant in the room of the accelerating loss of natural habitats and its knock-on effects for human and nature is human overpopulation. At the current, and even less at future, population sizes, our earth cannot provide the required resources – arable land, water, fish stocks, energy, housing. We may quibble how many people can be sustained, 8 billion, 10 billion, 3 billion, but we certainly cannot argue (with scientific validity) that we can ignore human overpopulation.

Paul Ehrlich, a long-time conservation activist, thus rightly chides the lack of this essential factor of the “food nexus” in current reports, e.g. that of the Scientific Group for the UN Food Systems Summit 2021. His short piece in *Nature* is correct, I think – except for the last sentence, which is quoted above.

The war metaphor, a common cry in US politics and culture, is here as inappropriate as in most other applications outside, well, actual war. In this case, also the term “scientist” is extremely unfortunate.

So, who are the “scientists” who “are at war”? And with whom? Do scientists currently fight some group of people aiming to end the human species? Who are they? As a scientist myself, I have plenty of “frontiers” I keep “battling” at: getting better data to test a theory, finding more appropriate ways to do an analysis, getting that paper past the thick-headed reviewers; but I am not “at war”. And I shouldn’t be, and neither should Mr Ehrlich: war is a horrible thing!

Our world is going down the drain in fast-forward. The destruction of our environment, which we cause and witness, is shocking. But the people causing the destruction are we! It is the same people who have to be part of the solution, unless you fancy a supervillain-style world-destroying collapse. We may have to convince ourselves that using more than is available is stupid, understand that egoistic and greedy behaviour is unacceptable, and realise that “big business” is almost by definition disinterested in humanity. But this is not a war. Any “we against them”-attitude is at best trash-talk, but more likely creates the trenches (sic!) we need to get out of.

And what does this have to do with scientists? Does Mr Ehrlich think that they have the moral high-ground? That they are intellectual super-beings? That they have any super-democratic right to define what is right and wrong? Sure, ecologists in particular, incl. Mr Ehrlich and myself, have the experience and the data to quantify the demise of our plant. But that we (humans) do not turn such knowledge into action has nothing to do with science (or scientists), and everything to do with (a) the unwillingness of the Global North to make sacrifices until the very last minute, if ever, and (b) the powerlessness of the Global South to make a difference, even if they wanted. (Admittedly, I am glossing over a few more points, which are also not related to science.)

War-talk will certainly not put overpopulation on anybody’s agenda. And pitting scientists against a strawman-army of anti-humanists won’t either. Pointing out, time and again, that overpopulation is, with overconsumption, the ultimate cause of environmental destruction, possibly, just possibly, may. I am with you, Mr Ehrlich, up to and excluding the last sentence of your piece.

]]>A well-known issue with mixed models is that the df lost for a random effect (RE) depend on the fitted RE variance, which controls the freedom (or conversely, the shrinkage) of the RE estimates. As a consequence, the **df lost by a RE a not fixed a priori**, but have to be estimated after fitting the model (adaptive shrinkage). One could add that even on a more fundamental level, the df implies that the null distribution is fixed (e.g. chi2) in the first place, which is also not entirely true in a non-asymptotic case.

As a consequence, ANOVA and AIC functions with naive df calculations should not be used for selecting on variance components of mixed models, and in principle, the problem can permeate to the fixed effects as well (if those change the RE df). Unfortunately, the options to calculated corrected df in R are relatively limited – the most common choice is the **lmerTest package**, which uses the **Satterthwaite approximation**, but this only works for linear models.

An alternative to a parametric tests (which requires df) is a simulated LRT based on a non-parametric bootstrap. The idea for this is very easy:

- As for any LRT, we have two models M0 nested in M1, H0: M1 is correct, and we want to use the test statistics of log LR(M1/M2).
- However, instead of trying to get a parametric expectation for the test statistic under H0, we simply simulate new response data under M0 (parametric bootstrap), refit M0 and M1 on this data, and calculate the LRs. Based on this, calculate p-values etc. in the usual way.

In principle, this should give you correct p-values for any LRT / ANOVA calculation that compares between (nested) model alternatives, regardless of whether those nested alternatives differ in fixed or random effects. I had often used this trick by hand, but it did occur to me that it would be very simple / useful to implement this in DHARMa, because DHARMa already implements wrappers for simulations and refitting of models for a wide range of GLMM functions in R.

OK, let’s see how this would work in practice: here’s the example for the function in the DHARMa package. Note that we simulate data with a relatively strong RE

```
library(DHARMa)
library(lme4)
# create test data
set.seed(123)
dat <- createData(sampleSize = 200, randomEffectVariance = 1)
# define Null and alternative model (should be nested)
m1 = glmer(observedResponse ~ Environment1 + (1|group), data = dat, family = "poisson")
m0 = glm(observedResponse ~ Environment1 , data = dat, family = "poisson")
# run LRT - n should be increased to at least 250 for a real study
out = simulateLRT(m0, m1, n = 10)
```

The result that we get is:

We also get (per default, can be switched off) a graphic representation of the simulated null distribution:

]]>Update March 23, 2022: a more concise / slightly modified version of these thoughts was published as a letter in TREE, co-authored together with Fred Barraquand.

Criticism of p-values has been rampant in recent years, as were predictions of their imminent demise. When looking closer, most of these critiques are actually not sparked by the definition of the p-value (Fisher, 1925) as such, but rather by the NHST framework (Neyman and Pearson, 1933; Neyman, 1950), which introduces a cut-off (significance level alpha) to transform the p-value into a binary decision (significant / n.s.) while trading off Type I/II error rates in the process.

Just to be clear: I fully agree with many of these critiques, in particular that a fixed threshold is somewhat arbitrary, favours p-hacking and looses information compared to continuous measures of evidence for / against the null. I also agree that p-values are too dominant in the presentation and discussion of scientific results, compared to other indicators, for example effect sizes & CIs. On the other hand, having an operational, objective framework to decide whether the data points towards an effect or not is important well beyond science (think funding, regulation and political decisions), and if the NHST framework is to be abandoned for that purpose, what should replace it?

In a recently published paper in TREE, Muff et al. recapitulate these issues. They also mention established alternatives to NHST, such as information criteria (IC), confidence intervals (CI) or the Bayes factor (BF); stating, however, that:

[…] when these alternatives are used to make binary decisions, for example regarding the inclusion of variables in model selection, when checking whether the null effect lies in a CI, or when employing a certain threshold of a Bayes factor, we are not doing anything different from an NHST. It can, for example, be shown that model selection based on the AIC criterion can be converted into P-value limits (e.g., [2]), and even Bayes factors have an approximate equivalent in terms of P-values [4,25,26]. Finally, checking whether a certain value (often 0) lies outside the CI is equivalent to checking the P-value limit, like P < 0.05 if the 95% CI is used, for example.

This statement seemed somewhat odd, because although it is of course possible to define thresholds on the BF, there is no need to do so. Kass & Raftery (1995), the classical citation for the BF, suggest that they should be interpreted as continuous measures of evidence. Here a snapshot from the paper:

Thus, if we want a measure of evidence in favour of H1, we could just use the BF. Muff et al., however, seem to implicitly dismiss the use of the BF by giving a different recommendation. They suggest to “stop using the term ‘statistical significance’” in conjunction with the p-value, and “replace it with a gradual notion of evidence” based on the p-value. What they mean by that is summarised in their Fig 1 (reproduced below).

They further recommend that this idea should be directly translated into text, as can be seen in a reproduction of Table 1 from their paper below (Fig. 3), which shows how statistical results should be reported.

The paper gives no explicit explanation for why they think this option is preferable over the more statistically established BF. What I imagine to read between the lines, however, is that they think that convincing people to use the BF would be too large a step, and we thus need something simple that is readily available and builds on what people know. For example, they write:

There is ample agreement, maybe the lowest common denominator of the whole discussion, that we should retire statistical significance by eliminating binary decision making from most scientific papers [12]. Such a transition will take time, but the show must go on today and we urgently need simple and safe ways to bypass the current state of disorientation.

I do understand the argument that using the p-value to calculate a continuous measure of evidence would be far more popular in the E&E community than using BF, because the p-value is what people are used to and what is reported by the software they use. Still, that presupposes that all information necessary to quantify the evidence provided by a study is contained in the p-value, which is far from obvious. We know, for example, that effect size alone is not sufficient to quantify to quantify the evidence for an effect, so how can we be sure that the p-value alone will suffice?

To approach the question of whether a p-value can be sensibly translated into evidence, we first have to consider what we mean by the word “evidence”. After all, if by evidence we simply mean some monotonously increasing function of 1-p, the proposal is trivially true. In this case, Fig. 1 / Table 2 would just amount to look-up tables that translate the p-value into words and vice versa, in the same way that we often display a * in regression tables.

Unfortunately, Muff et al. offer no formal definition of what they mean by “evidence”, other than saying that “evidence is a very intuitive concept”. Nevertheless, I think we can reject the interpretation above. First of all, a mere verbal translation of the numeric value of p would be highly redundant, as numeric p-values are presented in the text as well; and secondly, it is clear that the words “weak evidence”, “strong evidence” will be interpreted by a human reader not as a transformation of the p-value, but in the common-sense meaning of the word. By this I mean that after reading that a study presents “strong evidence”, I would understand that, in the absence of other information, I should be relatively convinced that an effect is present (let’s say something like > 80% probable), while after reading that a study that reports “weak evidence”, I would understand something like “more likely than not, but not so sure”. Also, I would understand that

- If two studies have equal p-value -> evidence, then we should have equal conviction that !H0 is true, in the absence of other information
- If a study 1 has a lower p-value -> evidence than study 2, then we should have less convinced based on the study 1 that !H0 is true than based on study 2, in the absence of other information
- It should also be possible to “add” the evidence in study 1 and 2 in some consistent way

A sensible operational definition of evidence that satisfies the above-stated properties is the Bayes Factor (BF), which gives us the relative probability of the hypothesis based on the data, assuming equal priors on the hypotheses. Assuming that other sensible definitions of evidence should thus provide similar values than the BF would thus require the existence of a UNIVERSAL correspondence between evidence as summarised by the BF, and evidence derived from a p-value according to Muff et al., irrespectively of the statistical model, sample size, power and the relative complexities of H0 and H1.

The fact that this cannot not true in general can be easily seen just by looking a the formula for the BF, where we see that the relative model complexity of H0 / H1 (including the parameter priors) interacts with the data size to produce the final BF in a way that is unlikely to be universally connected to the p-value. The variance can be somewhat reduced by standardising the parameter priors of H0/H1, but there are remaining discrepancies. This is nicely summarised in the review by Held & Ott (2018), who explicitly advice against translating p-values directly into evidence for that reason. If we look at Fig. 2a of this paper (Fig. 4 of this text), we see that the relationship of the BF to the p-value is sample-size dependent.

Noting that the BF doesn’t map universally on the p-value (Fig. 2a), Held & Ott write:

Assuming an alternative hypothesis

H1 has also been specified, the Bayes factor directly quantifies whether the data have increased or decreased the odds ofH0. A better approach than categorizing ap-value is thus to transform ap-value to a Bayes factor or a lower bound on a Bayes factor, a so-called minimum Bayes factor (Goodman 1999b). But many such ways have been proposed to calibratep-values, and there is currently no consensus on howp-values should be transformed to Bayes factors.

Here, they introduce the concept of the minimum BF, defined as “the smallest possible Bayes factor within a prespecified class of prior distributions over alternative hypotheses”. We can see that the minimum BF has a more general relationship with the p-value (Fig. 2b), but judging from the other figures in Held & Ott, (2018), this relationship is still not universal across all models (I suspect because the minimum BF is still contingent on priors and the hypotheses H0/H1); moreover, seeing that BF and minimum BF can be quite different, the question is how the minimum BF is to be interpreted, and if we really improve the status of statistical reporting by introducing “the smallest possible evidence in favour of H1” instead of the “evidence in favour of H1”. Held & Ott, (2018) is cited by Mutt et al., so I assume they are aware of the content of the paper, but they apparently interpreted it differently. Anyway, my understanding of the literature is that a universal correspondence between p-value and BF is hard to establish, but that what Muff et al. define is likely more close to a minimum BF than to a conventional BF, which means that it is NOT the evidence in favour of H1 (which would be the conventional BF).

To summarise, I see three immediate problems with the suggestion by Muff et al.:

- Muff et al. provide no operational definition (in terms of probabilities) of what they mean by evidence, other than the p-value itself. If we define the evidence = p-value, the definition is circular and redundant.
- Inevitably, however, readers will interpret the word “evidence” not as a p-value, but as an absolute measure of how persuasive the data is in favour of H1 (or at least !H0). In statistics, this concept is usually measured by the BF, but we can mathematically proof that the BF does not map universally on the definition of Muff et al., thus we would have two mutually inconsistent quantities for measuring evidence in the literature.
- More importantly, however, the current proposal is likely internally inconsistent as well, in the sense that studies with equal p-value do not always carry the same evidence according to common-sense definitions of the word. For example, the proposed mapping is likely not suitable to compare evidence across studies with different sample size (at least if by “more evidence” we mean “more likely to be true, all other things equal”), or accumulate evidence in a meta-analysis, as the evidence contained in p-value differs depending of the sample size of a study.

Based on this, I conclude that interpreting p-values directly as measures of evidence is unlikely to contribute to clarity in statistical reporting. If we want a continuous measure of evidence, a better solution would be to report the BF, which has a clear and established probabilistic definition. We could then think further about approximating the BF based on the p-value (e.g. by the minimum BF), but I do not see much sense in this because it is harder to interpret and, for most statistical end users, it would likely be easier to calculate the BF directly via available packages in R than to obtain the minimum BF.

Being explicit about the fact that evidence = BF would have an additional advantage: it allows us to also calculate posterior odds. Experience shows that many people (particularly with less statistical training) have a hard time to distinguish likelihood-based from posterior evidence, and will easily interpret a statement such as “we have strong evidence” as high posterior odds for an effect (i.e. as BF * priors). Personally, I would therefore recommend to report evidence (e.g.: “the data contains medium evidence for an effect”) together with the posterior odds of H1 (e.g. “together with prior odds from previous studies, we conclude that there is a high probability of an effect)”, rather than relying only on the (marginal) likelihood alone, as one can simultaneously have high evidence for an effect and low probability that an effect is present (as humorously noted in this XKCD cartoon).

The last point can easily be understood even from a frequentist viewpoint, when considering the rate at which significant p-values are type I errors, known as the false discovery rate (FDR). Fig. 3 show a slide that I usually use in my BSc Statistics lecture. What we see is that the the probability that a significant result is a false positive depends crucially on a) the rate at !H0 is true in experiments, and the power of the experiments (1-beta).

When analysing the formula, we must conclude, and this has been repeated many times in the literature, that when considering the odds that an effect is there based on the p-value, one must to so in conjunction with the power of the experiment, and the prior odds of an effect (see also the popular article by Regina Nuzzo, Nature News (2004)).

Held, L., & Ott, M. (2018). On p-values and Bayes factors. *Annual Review of Statistics and Its Application*, *5*, 393-419.

S. Muff, E. B. Nilsen, R. B. O’Hara, and C. R. Nater, “Rewriting results sections in the language

of evidence” Trends Ecol. Evol., Nov. 2021.

Fisher, R. (1925). *Statistical Methods for Research Workers, First Edition.* Edinburgh: Oliver and Boyd.

Neyman, J. (1950). *Probability and Statistics*. New York, NY: Holt.

Neyman, J., and Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. *Philos. Trans. R. Soc. Lond. Ser. A* 231, 289–337.

Nuzzo, Regina. “Scientific method: statistical errors.” *Nature News* 506.7487 (2014): 150.

Kass, R. E., & Raftery, A. E. (1995). Bayes factors. *Journal of the american statistical association*, *90*(430), 773-795.

As a result of this insight, a large number of statistical approaches exist where we try to optimise an objective of the form:

Quality(M) = L(M) – complexityPenalty(M)

where M is the model, L(M) is the likelihood, and complexityPenalty(M) adds some penalty for the model’s complexity. Examples for this structure are information criteria such as the AIC / BIC, shrinkage estimations such as lasso / ridge (L1 / L2) penalty, or the wiggliness penalty in gams.

When these techniques are introduced in stats classes, they are usually motivated as a means to reduce overfitting, based on the arguments that I gave above. It is well-known (however, possibly less widely) that many of these penalties can be reinterpreted as a Bayesian prior. For example, shrinkage penalties such as the lasso (L1) or the ridge (L2) are equivalent to a double exponential respectively normal prior on the regression parameters (see Fig1). Likewise, wiggliness penalties in gams can be reinterpreted as priors on functional simplicity (see Miller, David L. (2019)).

One may therefore be tempted to re-interpret complexity penalties from statistical learning such as L1/L2 as an a-priori preference for simplicity, similar to Occam’s razor. This, however, misses an important point: in statistical learning, the strength of the penalty is usually estimated from data. L1/L2 complexity penalties, for example, are usually optimised via cross-validation. Thus, the simplicity preference in these statistical learning methods is not really a priori (what you would expect if we had a fundamental / scientific, data-independent preference for simplicity), but it is something that is adjusted adjusted from the data to optimise the bias-variance trade-off. Note also that, in low-data situations, the penalty may easily favour models that are far simpler than the truth.

This is the reason why classical L1/L2 regularisations are better interpreted as “empirical Bayesian” rather than fully Bayesian. Empirical Bayesian methods are methods that use the Bayesian framework for inference, but with priors that are estimated from data. Empirical and fully Bayesian perspectives can be switched or mixed though. One could, for example, add additional data-independent priors on simplicity in a model, and in some sense the common Bayesian practice of using “weakly informative” (data-independent) priors on regression parameters could be interpreted as a light fundamental preference of Bayesian for simplicity.

How does that help us in practice? Well, for example, I am a big fan of shrinkage estimators and would nearly always prefer them over variable selection. The reason why they are rarely used in ecology, however, is that frequentist regression packages that use shrinkage (such as glmnet) don’t calculate p-values. The reason is that obtaining calibrated p-values or CIs with nominal coverage for shrinkage estimators is hard, showing that the latter are probably better understood as a statistical learning method that optimises predictive error than a frequentist method that has controlled error rates. If we re-interpret the shrinkage estimator as a prior in a Bayesian analysis, however, we naturally get normal posterior estimates that can be interpreted pretty straightforward for inference. Thus, if you want to apply L1 / L2 penalties in a regression without loosing the ability to discuss the statistical evidence for an effect, just do it Bayesian!

References

Miller, David L. (2019) “Bayesian views of generalized additive modelling.” *arXiv preprint arXiv:1902.01330* .

Polson, N. G., & Sokolov, V. (2019). Bayesian regularization: From Tikhonov to horseshoe. *Wiley Interdisciplinary Reviews: Computational Statistics*, *11*(4), e1463.

Park, T., & Casella, G. (2008). The bayesian lasso. *Journal of the American Statistical Association*, *103*(482), 681-686

What is the cause and what are the processes that gave rise to Earth’s biodiversity patterns through space and time? Much research has been devoted to describing these patterns, and over the years, the fields of macroecology and macroevolution have slowly transitioned from a mainly correlational to a more mechanistic perspective _{(1, 2)}. The challenge with understanding the mechanisms of macroevolution is that, while evolution has in principle simple general rules, it operates across a complex dynamic world. As a result, there is only so much we can understand with simple theoretical and empirical models – for a more detailed understanding of the diversification of life on earth, we will require models that reflects not only ecological and evolutionary processes, but also the complexity in spatio-temporal drivers of the system, in particular changes in climatic and geographic patterns over evolutionary time scales. Such flexible eco-evolutionary models that use realistic dynamic landscapes allow to realistically compare candidate processes leading to the emergence of biodiversity patterns (such as past and present α, β, and γ diversity, species ranges, ecological traits, and phylogenies) against empirical evidence.

In this post, I share the story of the development of *gen3sis* _{(1)}, an exciting new simulation engine that hopefully will bring us closer to uncover some of the mysteries behind Earth’s biodiversity. *gen3s*is stands in the tradition of scientists moving from simple mathematical to more complex computational models _{(3)}, and my academic development followed a somewhat similar path. Around 2013 I developed a generalized phylogenetic tree simulator (TreeSimGM) based on multiple probability density functions for speciation and extinction (Bellman Harris Model) together with T. Stadler _{(4)}. We found that age-dependent speciation best explained empirical topologies (tree shape balance) _{(5)}. However, linking such abstract probability functions to real processes is difficult and limited to hypothesis formulation and further speculation. In 2016 I dug deeper into the biological mechanisms underlying biodiversity dynamics by adding more detailed ecological processes to an existing spatially explicit macroevolutionary model (SPLIT) written by P. Descombes, T. Gaboriau, F. Leprieur, L. Pellissier and others _{(6, 7)}. This allowed us to investigate the emergence of global biodiversity patterns.

Informed by my previous experiences on generalising a birth-death model inside a new context, I wanted to build a more modular and flexible simulation engine which became *gen3sis*: *general engine for eco-evolutionary simulations* _{(1)}. The idea was to overcome the limitations of simple models that do not consider explicit spatial-temporal changes or spatial models that are built around fixed assumptions and ignore or limit experimentation of ecological, evolutionary as well as complex interactions. By allowing for custom ecological and evolutionary process and interactions in an explicit dynamic landscape, we can better predict and understand diversification under changing conditions and expose complex processes at multiple temporal and spatial scales.

During this time, *gen3sis*’ architecture changed multiple times and its development involved multiple interdisciplinary contributions, including the dialog between software engineers, geologists, modelers, and empiricists. For example, Benjamin Flück, a software engineer, joined the team and helped optimizing code (e.g. R to C++) and passing selected functions into a customizable configuration file. This relaxation of eco-evolutionary rules input over a configuration file demanded further thoughts on functions and parameters naming, for proper mechanisms categorization and intuitive model use. Important to this naming and process definition process was the involvement of the Landscape Ecology and an sDiv synthesis group with participants from multiple backgrounds and specific ecological or evolutionary perspectives. Finding a balance between speed, generality and usability was a long trial and error process.

The result is a modelling engine that for the first time offers the ability to simulate almost any scenario for extraordinary insights to life on earth from deep-time to large spatial scales. *Gen3sis *keep track of differentiation between populations, allowing for distance decrease after secondary contact, while permitting multiple traits that can evolve and interact with biotic and abiotic components linking ecological and evolutionary processes. Non changeable and central to the model are the calculations of clusters of connected populations, which are based on universal principles of geneflow between populations in a spatial context and dependent on dispersal abilities. Initial conditions as well as other modelled processes including speciation, dispersal, trait evolution and ecology are changeable and interconnected in a very customizable and intuitive way.

For example, take speciation, which is essential to understanding the emergence of biodiversity. In most phylogenetic macro-evolutionary simulators, speciation happens according to a probability density function in a space-less fashion. In *gen3sis*, new species results from a set of rules (functions informed by a user defined configuration file and speciation happens in allopatry, after populations are spatially isolated for a certain period of time. This isolation can depend on: (1) species dispersal abilities which can evolve and tradeoff with other traits; (2) landscape connectivity which can consider barriers (e.g. land for aquatic or water for terrestrial organism) and change over time (e.g. a reconstructed paleolandscape); (3) ecological processes which can modulate abundances or presences considering abiotic and biotic conditions as well as (4) evolutionary processes that dictate persistence under changing conditions or adaptation to new settings. Additional mechanisms and feedbacks are possible, such as the inclusion of temperature effects on mutation, or metabolic rates. Consequently, model complexity is customizable, allowing us to test and see if we can differentiate between models.

*Gen3sis *is more than just developing an eco-evolutionary model to answer one specific question. *Gen3sis *is a general engine allows the formalization and testing of ecological and evolutionary processes happening in complex and dynamic landscapes. *Gen3sis*’ flexibility opens up a wide range of future applications, demonstrated in a case study accompanying the methods publication on PLOS Biology addressing the latitudinal diversity gradient _{(1)}. On another – soon to be published – study, *gen3sis *revealed the importance of palaeoenvironmental dynamics, rather than current climatic factors, on the formation of uneven distribution of biodiversity across tropical regions. Currently, I am using *gen3sis *to study local processes and better scale mechanism in space, time and levels of complexity using regional metacommunity eco-evolutionary experiments.

Exciting other possible future applications could address causal links between biodiversity and: (a) orogenetic and/or erosion models; (b) aquatic ecological and/or evolutionary processes; (c) temperature and/or water availability; (d) climatic variations; (e) intraspecific genetic variability; (f) functional traits such as niche width and dispersal abilities as well as (g) emerging interaction networks. Practical use could involve long term conservation planning, such as wildlife corridors, or modeling the spreading of infectious diseases under multiple scenarios (e.g. COVID). Alternatively and personally very interesting for me is that *gen3sis *can contribute to fields that are traditionally not relying on biological principles, such as cultural and technological evolution. For more nonexhaustive expected applications of *gen3sis *see Table 4 in _{(1)}.

While we are far from predicting the emergence of biodiversity patterns on Earth, *gen3sis *offers an open source tool able to simulate gradual changes influenced by multiple factors in constant interaction over a long period of time. This has the potential to advance knowledge in multiple, interdisciplinary research areas. *Gen3sis *is available as an R-package on CRAN along beginners’ tutorials, in order to facilitate use, dialog and support of other scientists to piece together key puzzles of the Earth’s astonishing biodiversity. Available on github under GPL3, *gen3sis *inspires to provide open model development inside a critical and varied community. For this, you are more than welcome to join!

I thank Florian Hartig, Laura Méndez and Emma Ladouceur for comments and feedbacks.

- short historical perspective see monography introduction (~25min read)
- another blog post commenting on gen3sis (~15min read)
- R-package github and CRAN repository

1. O. Hagen, B. Flück, F. Fopp, J. S. Cabral, F. Hartig, M. Pontarp, T. F. Rangel, L. Pellissier, gen3sis: A general engine for eco-evolutionary simulations of the processes that shape Earth’s biodiversity. *PLOS Biol.* **19**, e3001340 (2021).

2. M. Pontarp, L. Bunnefeld, J. S. Cabral, R. S. Etienne, S. A. Fritz, R. Gillespie, C. H. Graham, O. Hagen, F. Hartig, S. Huang, R. Jansson, O. Maliet, T. Munkemuller, L. Pellissier, T. F. Rangel, D. Storch, T. Wiegand, A. H. Hurlbert, The latitudinal diversity gradient: Novel understanding through mechanistic eco-evolutionary models. *Trends Ecol. Evol.* **34**, 211–223 (2019).

3. M. Weisberg, *Simulation and Similarity: Using Models to Understand the World* (OUP USA, 2013; https://books.google.de/books?id=rDu5e532mIoC), *Oxford Studies in Philosophy of Science*.

4. O. Hagen, T. Stadler, TreeSimGM: Simulating phylogenetic trees under general Bellman-Harris models with lineage-specific shifts of speciation and extinction in R. *Methods Ecol Evol*. **9**, 754–760 (2018).

5. O. Hagen, K. Hartmann, M. Steel, T. Stadler, Age-dependent speciation can explain the shape of empirical phylogenies. *Syst. Biol.* **64**, 432–440 (2015).

6. F. Leprieur, P. Descombes, T. Gaboriau, P. F. Cowman, V. Parravicini, M. Kulbicki, C. J. Melian, C. N. de Santana, C. Heine, D. Mouillot, D. R. Bellwood, L. Pellissier, Plate tectonics drive tropical reef biodiversity dynamics. *Nat. Commun.* **7**, 11461 (2016).

7. P. Descombes, F. Leprieur, C. Albouy, C. Heine, L. Pellissier, Spatial imprints of plate tectonics on extant richness of terrestrial vertebrates. *J. Biogeogr.* **44**, 1185–1197 (2017).

At the time, there was a heavy backlash against the study, and probably rightly so, as the statistical analysis turns out to be highly unstable against a change of the regression formula. You can find some links here. Over the years, however, I have found that this study has at least one virtue: it is an excellent example for teaching students about the importance of selecting the right functional relationship when running an analysis, and that substantial “dark” uncertainty can arise from these researcher degrees of freedom.

The reason why the hurricanes make such an excellent pedagogical example is that, as I point out here, the effect of femininity is highly unstable and depends strongly on which predictors you select, presumably because of high collinearity, the considered interaction(s) and the unbalanced femininity / mortality distribution.

In the stats course that I just finished teaching, I gave the students the task to re-analyze the hurricane data, which also led me to run some DHARMa residuals checks on the original negative binomial model fitted by Jung et al. Here is the residual analysis of the model with DHARMa, for technical reasons fit with glmmTMB and not with mgcv (as in the original study). The main DHARMa residual plot shows a kind of funky pattern, but those are not flagged as significant by the tests:

If we plot residuals against NDAM, however, we get a clear and very significant misfit. The original model is obviously not acceptable, and the student teams that did the re-analysis practically all spotted this immediately. Serves as a reminder of how efficient systematic residual checks for GLMMs are. In the defense of the authors: DHARMa was not available at the time, although this pattern was also visible in standard Pearson residuals, as pointed out by Bob O’Hara at the time.

We also find (light) spatial autocorrelation, but with a negative lag 1. One may speculate that this could arise if people are more careful after a particularly deadly hurricane in the last year, but it is equally possible that this is a fluke / false positve.

If you want to repeat the residual analysis, here’s the code. The data is conveniently stored by PNAS.

]]>Overdispersion is a common problem in GL(M)Ms with fixed dispersion, such as Poisson or binomial GLMs. Here an explanation from the DHARMa vignette:

GL(M)Ms often display over/underdispersion, which means that residual variance is larger/smaller than expected under the fitted model. This phenomenon is most common for GLM families with constant (fixed) dispersion, in particular for Poisson and binomial models, but it can also occur in GLM families that adjust the variance (such as the beta or negative binomial) when distribution assumptions are violated.

The main issue with overdispersion is that

- p-values tend to be too small, thus leading to inflated Type I error
- CIs will be to small, thus leading to overconfidence about the precision of the estimate

Several R packages, notably DHARMa, allow testing GL(M)Ms for overdispersion. For version 0.3.4, we added a new parametric dispersion test, and we also recently ran a large number of additional comparative analysis on their power in different situation. The gist of this work is: the tests are pretty good at picking up on dispersion problems in a range of models.

But are those tests good enough? Or are they maybe too good, meaning that: if you get a significant dispersion test, should you switch to a variable dispersion glm (e.g. neg. binom), even if the dispersion problem is mild? To do a quick check of this question, I ran a few simulations using the DHARMa::runBenchmarks function.

```
library(DHARMa)
returnStatistics <- function(control = 1){
testData = createData(sampleSize = 200, family = poisson(),
overdispersion = control, fixedEffects = 0,
randomEffectVariance = 0)
fittedModel <- glm(observedResponse ~ Environment1, data = testData, family = poisson())
x = summary(fittedModel)
res <- simulateResiduals(fittedModel = fittedModel, n = 250)
out <- c("Type I error GLM slope" = x$coefficients[2,4],
"DHARMa testDispersion" = testDispersion(res, plot = FALSE)$p.value)
return(out)
}
out = runBenchmarks(returnStatistics, controlValues = seq(0, 1.5, 0.05), nRep = 500)
plot(out, xlab = "Added dispersion sd", ylab = "Prop significant", main = "n = 200")
```

The idea of these simulations is to slowly increase the overdispersion in a Poisson GLM where the true slope of a tested effect (Environment1) is zero. As overdispersion increases, we will get increasing Type I error (false positives) on the slope. We can then compare the Power of the DHARMa::testDispersion function with the rate of Type I errors, to see if DHARMa would have warned us early enough about the problem, or if the dispersion tests are maybe warning too early, i.e. before calibration issues in the p-values of the GLM get serious (see Fig. 1, calculated for different sampleSizes n)

The results suggest that the power of DHARMa overdispersion tests depends more strongly on sample size than the increase of Type I error caused by the overdispersion. More precisely, for small sample sizes (n = 10/40), overdispersion tests pick up a signal roughly at the same point where overdispersion starts to create problems in the GLM estimates (i.e. false positives in regression effects).

For larger sample sizes (in particular n = 5000), however, even small levels of overdispersion are being picked up by DHARMa, whereas the GLM type I error is surprisingly unimpressed by the sample size. I have to say I was a bit surprised about the latter behaviour, and still do not fully understand it. It seems that the increase of type I error in a Poisson GLM mainly depends on the nominal dispersion and not so much on the sample size. Please comment if you have any idea about why this would be the case, I would have expected sample size to play a role as well.

Whatever the reason for the GLM behaviour, my conclusions (disclaimer: this is of course all only for a simple Poisson GLM, one should check if this generalises to other models) are as follows:

- In my simulations, problems with overdispersion were only substantial if a) tests are significant and b) the dispersion parameter is large, say e.g. > 2.
- This suggests that one could ignore significant overdispersion tests in DHARMa if n is large AND the estimated dispersion parameter is close to 1 (with the idea that the tests gest very sensitive for large n, thus picking up on minimal dispersion problems that are unproblematic for the GLMs)
- That being said: given that n is large, it seems no problem to support an additional parameter for adjusting the dispersion, e.g. through a negative binomial, so the question is if there is any reason to make this call

My overall recommendation would still be to move to a variable dispersion glm as soon as DHARMa dispersion tests flag overdispersion. But if you have particular reasons for avoiding this, you could ignore a positive test if n is large and the dispersion is close to 1.

**Edit 25.3:** in response to a question by **Furchtk**, I have made some more simulations varying the intercept of the Poisson (Fig. 2). What we can clearly see is that lower intercepts behave a bit similar to lower n, which is to be expected, as the integer stochasticity of the Poisson increases towards lower means. I’m not sure that I see anything else happening here, but again, probably one would have to check more systematically. It is a good point though that we should see n in relation to the mean as well, i.e. if we have a Poisson mean of 0.01, n=20 means a different thing than if the mean is 10.

**Update April 2020:** this paper has now been published in Nature, with a comment by Mark Pagel. From skimming the published version, it seems to me that the text has been a bit condensed, and that the implications were possibly a bit toned down, but I believe that the comments here largely remain valid for the published version as well.

**Update July 2020:** Helene Morlon and Stéphane Robin and I have formulated a more in-depth analysis of the article, which is available here.

**Update March 2022:** Helene Morlon and Stéphane Robin and I have written a larger opinion piece “Studying speciation and extinction dynamics from phylogenies: addressing identifiability issues” published in TREE, available here.

Consider the following analysis task, which is arguably one of the most important in macroevolutionary research:

- We have a time-calibrated phylogeny for all extant species of a clade, but no information about the extinctions that presumably happened during its diversification from the last common ancestor to the present day.
- We want to fit statistical models (so-called birth-death models) to draw inference about speciation rates (birth = b) and extinction rates (death = d), and how those rates changed over time, so we are looking to infer b(t), d(t).

Let’s stay with the assumption of constant birth / death rates b,d for the moment. It may be surprising that it is indeed possible to simultaneously infer b and d from an extant tree. Surprising, because one might think that an increase in d could always be counteracted by increasing b to arrive at the same number of extant species, which would naively render b,d, unidentifiable. However, the model with constant birth-death rates is identifiable, although the uncertainty regarding the difference b–d is generally much lower than that of d/b, i.e. it is easier to estimate an effective diversification rate b-d, than the precise values of d/b (Nee, 2006), a result that we also find in (Maliet et al., 2019).

My intuition about this was (up to now) the following: yes, b/d trade off with regard to the final number of species, but combinations of larger b/d that produce the same number of final species will create more variation within the phylogeny, which makes it possible separate the parameters. Although, after reading the paper, I have to say that conceit that the reason is maybe a different one. Anyway.

Macroevolutionary analysis does of course not stop at constant b/d models. A big interest of the field is to understand how diversification rates change over time, e.g. to examine the effect of environmental conditions, key innovations etc. on speciation and extinction rates. A large range of statistical models have been proposed and fit that allow time or environment to affect speciation or extinction rates (Condamine et al., 2013), or that allow shifts in diversification rates at some points in time or for some clades (e.g. Rabosky et al., 2014).

Against this background, the main claim of Louca & Pennel is that

- The likelihood of a given diversification model depends only on the lineage-through-time plot (i.e. the diversity through time is a sufficient statistic for this type of problem)
- Asymptotically (i.e. for many species), the LTT plot can be modeled by a set of differential equations, which describe the temporal change dM/dt of the number of species in the LTT. Analysis of this equation shows that a large family of functions d(t), m(t) can produce the same M(t), i.e. the same LTT plot.
- Thus, if we assume that birth and death rates can be arbitrary functions of time, it’s not possible to simultaneously identify d(t), m(t). Rather, there are multiple diversification histories that will produce the same LTT. Louca & Pennel propose thus to only consider an effective diversification parameter (what they call the pulled speciation rate), which is identifiable, but will map on multiple, possibly quite different d(t), m(t) combinations.

For constant m/d, the pulled diversification rate will not be constant, but given by a differential equation. Because this is really the central point of the paper, I copy the part of the paper in full.

Analysis of these equations reveals that very different b(t), d(t) models can have identical pulled speciation / diversification rates and thus produce identical LTTs.

Louca & Pennel also address the question of why no one has noticed this before (I’m sure they got the same question from the reviewers). They argue that most common models that are fit to data specify functions for d(t), m(t) that will only intersect with one value of the pulled speciation rate. A visualisation of this idea is provided below:

A first, somewhat tangential comment, is about claim 1, which defines the scope of the paper: Louca & Pennel consider models for which diversification and extinction rates are functions of time, or some other variable that acts uniformly across the phylogeny. It is true that this is the assumption of many models, but there are other important models where b/d rates differ between lineages / subclades rather than time. My feeling was that if the arguments in Louca & Pennel have merit (more on that below), it should be possible to generalize them also to more general conditions (such as those we consider in Maliet et al., 2019). In any case, the point that the likelihood depends only on the LTT plot (or M(t)) seems to me more like a simplifying assumption than a result.

The main question, however, is clearly about points 2 and 3 above (i.e. fact that it’s not possible to distinguish between quite different diversification histories), which obviously have profound implications for macroevolutionary analysis. I would like to approach this claim from two sides

- is the proof that leads to the claim that models with identical pulled speciation rate have the same likelihood correct?
- if so, how much does that matter, given that most current models seem to be identifiable

I’ll be brief here: I don’t know. The proof looks overall convincing to me, except for one concern, which is that Louca & Pennel first consider asymptotics (to express M(t) as a smooth differentiable function), and then deriving the likelihood based on this smooth M(t). This feels a bit like switching limits in a mathematical series. Or in other words: we are first making the LTT smooth by taking the limit of n->infinity, and then calculate the likelihood on this smooth LTT, whereas strictly speaking, we should first calculate the likelihood, and then take the limit for n. My concern is that there might be local variation in the LTT that contains information for the inference, but that is now hidden by the fact that we take the limit of tree size to infinity first. I had always thought (see my comments above) that the differences in stochasticity of different b,d combinations are at least in part responsible for their identifiability. It seems to me that Louca & Pennel suggest that this is indeed not so, and that the different shape of the LTT is the only reason for identifiability.

However, I’m not sure that this is in fact an issue, and the idea of taking this limit goes back to at least Morlon et al., 2011. Still, I guess I’d simply like to convince myself with some very thorough, large number of replicate simulations, that there is no information hidden in the stochasticity. Maybe someone with more insight on this has thoughts / comments?

Let’s say the proof holds. The question then is how much this matters for the field and the existing methods. Louca & Pennel conceit that most models that are current fit are identifiable. I would argue that this shows that the situation is maybe not as bleak as they suggest, in the sense that what most people have so far found worth testing is testable.

Especially when we consider that the fact that arbitrary b(t), d(t) functions are not identifiable is not surprising, just from counting degrees of freedom. If we have M branching events, we can never fit a model with a change in d(t), m(t) at each branching event. Such a model would be desperately over-parameterized. Just algebraically, we can only hope to fit a model with a change in d(t), m(t) at every second branching event. If we add stochasticity to the equation, e.g. the rule that you typically need 10x the data to constrain 1df, we would arrive at a rule of thumb of requiring around 20 branching events for every degree of freedom in the functions specifying d(t), m(t) (with strong trade-offs between d/m in the likelihood, possibly more). These back-of-the-envelope calculations suggest to me that, for a 100 species phylogeny, we could anyway only hope to fit 1-3 parameters for b(t), d(t) each. This is probably not complex enough to produce the shapes in Fig. 1.

So, effectively, if all the things that we could have hoped for anyway are doable, is it really so important that we can’t distinguish between a large number of crazy scenarios? Don’t get me wrong, if the proof of Louca & Pennel is right, I think it’s a useful point, and the differences between models that (supposedly) lead to the same likelihood is quite impressive. I’m just wondering if there is any difference for the typical analyses in the field, that are run over small clades and model complexity is limited by data anyway. And for large clades, there are many other assumptions that are probably violated by the b(t), d(t) model, including that diversification rates are homogenous across subclades, and independent between lineages.

A final point: Louca & Pennel suggest that inference should concentrate essentially on the pulled speciation and diversification rates, which define the “congruent sets” of diversification scenarios that are compatible with a LTT. I didn’t get what we gain by that. In the end, this is just a re-transformation of the LTT plot that is hard to interpret. The insight that a problem is overparameterized or unidentifiable would suggest to me that we have to think harder about how to make it fitable, e.g. by reducing the number of parameters in the model, or add regularization on the parameters (as we did, e.g., in Maliet et al., 2019, where we fit a change in diversification rate at each time step with a regularization that assumes that diversification rates tend to stay similar between time steps). So, if the results hold, what I would take from the paper is that macroecology has to think hard (possibly harder than before) either about about specific candidate mechanisms and hypotheses, whose predictions can then be contrasted with the data, or about statistical priors, regularisation or null assumptions that make the problem identifiable.

Louca, S., Pennell, M.W., 2019. Phylogenies of extant species are consistent with an infinite array of diversification histories. bioRxiv 719435. https://doi.org/10.1101/719435

Maliet, O., Hartig, F., Morlon, H., 2019. A model with many small shifts for estimating species-specific diversification rates. Nat. Ecol. Evol. 3, 1086–1092. https://doi.org/10.1038/s41559-019-0908-0

Pontarp, M., Bunnefeld, L., Cabral, J.S., Etienne, R.S., Fritz, S.A., Gillespie, R., Graham, C.H., Hagen, O., Hartig, F., Huang, S., Jansson, R., Maliet, O., Münkemüller, T., Pellissier, L., Rangel, T.F., Storch, D., Wiegand, T., Hurlbert, A.H., 2018. The Latitudinal Diversity Gradient: Novel Understanding through Mechanistic Eco-evolutionary Models. Trends Ecol. Evol.

Nee, S., 2006. Birth-Death Models in Macroevolution. Annu. Rev. Ecol. Evol. Syst. 37, 1–17. https://doi.org/10.1146/annurev.ecolsys.37.091305.110035

Condamine, F.L., Rolland, J., Morlon, H., 2013. Macroevolutionary perspectives to environmental change. Ecol. Lett. 16, 72–85.

Rabosky, Daniel L., et al. “BAMM tools: an R package for the analysis of evolutionary dynamics on phylogenetic trees.” *Methods in Ecology and Evolution* 5.7 (2014): 701-707.

H. Morlon, T. L. Parsons, J. B. Plotkin, Reconciling molecular phylogenies with the fossil record. Proceedings of the National Academy of Sciences 108, 16327–16332 (2011).

]]>BioScience has just published the latest installment of “Scientists’ Warnings“. There have been two previous such Warnings, the latest organised by the same authors in 2017. Quite a few scientists have signed this Warning. I chose not to, although I had signed the previous one in 2017.

I have been hackled, by a colleague from the Economics department, why I don’t rush to present and justify my research activities to the public. Parttaking in societal deliberations about climate change is one such “outreach”. He implied that it is “irresponsible” to work in an ivory tower (although it may actually be more an ivory basement). Reading the repeated Scientists’ Warning, I got a better feeling for why I disagree, and why I didn’t sign this time round. And it has nothing to do with whether I agree, as a private person, with the statement (for the record: I do).

In Neal Stephenson’s book Anathem, scientist are separated from the rest of the world and live in cloisters. They work in different casts, if you like, differentiated by how often they have contact with the outer, secular world: every year, every ten years, every century and every millennium. In between, they receive no mail, no books, no information from the world outside (apart from hearing planes flying over and from discussions with their brothers and sisters in the other casts, who are forbidden to touch any topic topical and current in these conservations). As a result, the “uniarians” discuss and work on issues of near-term, almost immediate nature, while in the extreme the millennarians take the long view.

The explosion of human population size and the resulting devastation that we human inflict on ourselves and the plant (climate change, deforestation, desertification, water pollution, you name it) poses a challenge to those ecologists who sympathise with a 100- or 1000-year view of their research. I sometimes half-jokingly refer to science as that bit of research that is still true in 500 years. That over 11,000 scientists signed the last Warning is perceived as a very strong statement by the “general public” (or so my non-scientific friends tell me). The strength comes from the fact that scientists, by and large, are perceived as impartial, rational and as taking the long view.

I decided not to sign the latest Scientists’ Warning, because my long (or at least mid-term) view is currently extremely clouded. The cacophony of current affairs, media outbursts, scientific and funding rush to Climate Change and (loss of) Biodiversity pushes past reflexion, arguments and understanding. I perceive an increasing proportion of the work in my field to be tainted by advocacy and short-termism. I cringe at oversimplified podium statements of do-gooders of my own discipline, at newspaper interviews and podcasts, for example describing building dams as “destroying biodiversity” because several hectars of riparian forest are lost. (Here the term “biodiversity” is used as synonymous with “nature” or “wild stuff”, not in any of its already too vague actual meanings.) Asked something “simple”, such as “How can we decrease chemical contamination of our environment?”, stuff that we teach in the Bachelor programme, I drop my gaze and stare at my shoes: this is not the right question; this is about moral judgement, about societal values, about political attitude. But these are “short-term views”, and, in my above definition, not science. (The scientific answer is obvious, even to the layperson asking.)

So, for the time being, as a scientist I pull out of street marches, petitions, twitter tirades (well, that was easy) and public calls for “them” to do “something” against climate change and and insect decline. My (private, but science-infused) longer-term view identifies overpopulation, slack in social norms and socially encouraged egotism (“Get rich or die trying”) as underlying problems. As a scientist, I am not qualified to comment on this.

]]>