**Update April 2020:** this paper has now been published in Nature, with a comment by Mark Pagel. From skimming the published version, it seems to me that the text has been a bit condensed, and that the implications were possibly a bit toned down, but I believe that the comments here largely remain valid for the published version as well.

Consider the following analysis task, which is arguably one of the most important in macroevolutionary research:

- We have a time-calibrated phylogeny for all extant species of a clade, but no information about the extinctions that presumably happened during its diversification from the last common ancestor to the present day.
- We want to fit statistical models (so-called birth-death models) to draw inference about speciation rates (birth = b) and extinction rates (death = d), and how those rates changed over time, so we are looking to infer b(t), d(t).

Let’s stay with the assumption of constant birth / death rates b,d for the moment. It may be surprising that it is indeed possible to simultaneously infer b and d from an extant tree. Surprising, because one might think that an increase in d could always be counteracted by increasing b to arrive at the same number of extant species, which would naively render b,d, unidentifiable. However, the model with constant birth-death rates is identifiable, although the uncertainty regarding the difference b–d is generally much lower than that of d/b, i.e. it is easier to estimate an effective diversification rate b-d, than the precise values of d/b (Nee, 2006), a result that we also find in (Maliet et al., 2019).

My intuition about this was (up to now) the following: yes, b/d trade off with regard to the final number of species, but combinations of larger b/d that produce the same number of final species will create more variation within the phylogeny, which makes it possible separate the parameters. Although, after reading the paper, I have to say that conceit that the reason is maybe a different one. Anyway.

Macroevolutionary analysis does of course not stop at constant b/d models. A big interest of the field is to understand how diversification rates change over time, e.g. to examine the effect of environmental conditions, key innovations etc. on speciation and extinction rates. A large range of statistical models have been proposed and fit that allow time or environment to affect speciation or extinction rates (Condamine et al., 2013), or that allow shifts in diversification rates at some points in time or for some clades (e.g. Rabosky et al., 2014).

Against this background, the main claim of Louca & Pennel is that

- The likelihood of a given diversification model depends only on the lineage-through-time plot (i.e. the diversity through time is a sufficient statistic for this type of problem)
- Asymptotically (i.e. for many species), the LTT plot can be modeled by a set of differential equations, which describe the temporal change dM/dt of the number of species in the LTT. Analysis of this equation shows that a large family of functions d(t), m(t) can produce the same M(t), i.e. the same LTT plot.
- Thus, if we assume that birth and death rates can be arbitrary functions of time, it’s not possible to simultaneously identify d(t), m(t). Rather, there are multiple diversification histories that will produce the same LTT. Louca & Pennel propose thus to only consider an effective diversification parameter (what they call the pulled speciation rate), which is identifiable, but will map on multiple, possibly quite different d(t), m(t) combinations.

For constant m/d, the pulled diversification rate will not be constant, but given by a differential equation. Because this is really the central point of the paper, I copy the part of the paper in full.

Analysis of these equations reveals that very different b(t), d(t) models can have identical pulled speciation / diversification rates and thus produce identical LTTs.

Louca & Pennel also address the question of why no one has noticed this before (I’m sure they got the same question from the reviewers). They argue that most common models that are fit to data specify functions for d(t), m(t) that will only intersect with one value of the pulled speciation rate. A visualisation of this idea is provided below:

A first, somewhat tangential comment, is about claim 1, which defines the scope of the paper: Louca & Pennel consider models for which diversification and extinction rates are functions of time, or some other variable that acts uniformly across the phylogeny. It is true that this is the assumption of many models, but there are other important models where b/d rates differ between lineages / subclades rather than time. My feeling was that if the arguments in Louca & Pennel have merit (more on that below), it should be possible to generalize them also to more general conditions (such as those we consider in Maliet et al., 2019). In any case, the point that the likelihood depends only on the LTT plot (or M(t)) seems to me more like a simplifying assumption than a result.

The main question, however, is clearly about points 2 and 3 above (i.e. fact that it’s not possible to distinguish between quite different diversification histories), which obviously have profound implications for macroevolutionary analysis. I would like to approach this claim from two sides

- is the proof that leads to the claim that models with identical pulled speciation rate have the same likelihood correct?
- if so, how much does that matter, given that most current models seem to be identifiable

I’ll be brief here: I don’t know. The proof looks overall convincing to me, except for one concern, which is that Louca & Pennel first consider asymptotics (to express M(t) as a smooth differentiable function), and then deriving the likelihood based on this smooth M(t). This feels a bit like switching limits in a mathematical series. Or in other words: we are first making the LTT smooth by taking the limit of n->infinity, and then calculate the likelihood on this smooth LTT, whereas strictly speaking, we should first calculate the likelihood, and then take the limit for n. My concern is that there might be local variation in the LTT that contains information for the inference, but that is now hidden by the fact that we take the limit of tree size to infinity first. I had always thought (see my comments above) that the differences in stochasticity of different b,d combinations are at least in part responsible for their identifiability. It seems to me that Louca & Pennel suggest that this is indeed not so, and that the different shape of the LTT is the only reason for identifiability.

However, I’m not sure that this is in fact an issue, and the idea of taking this limit goes back to at least Morlon et al., 2011. Still, I guess I’d simply like to convince myself with some very thorough, large number of replicate simulations, that there is no information hidden in the stochasticity. Maybe someone with more insight on this has thoughts / comments?

Let’s say the proof holds. The question then is how much this matters for the field and the existing methods. Louca & Pennel conceit that most models that are current fit are identifiable. I would argue that this shows that the situation is maybe not as bleak as they suggest, in the sense that what most people have so far found worth testing is testable.

Especially when we consider that the fact that arbitrary b(t), d(t) functions are not identifiable is not surprising, just from counting degrees of freedom. If we have M branching events, we can never fit a model with a change in d(t), m(t) at each branching event. Such a model would be desperately over-parameterized. Just algebraically, we can only hope to fit a model with a change in d(t), m(t) at every second branching event. If we add stochasticity to the equation, e.g. the rule that you typically need 10x the data to constrain 1df, we would arrive at a rule of thumb of requiring around 20 branching events for every degree of freedom in the functions specifying d(t), m(t) (with strong trade-offs between d/m in the likelihood, possibly more). These back-of-the-envelope calculations suggest to me that, for a 100 species phylogeny, we could anyway only hope to fit 1-3 parameters for b(t), d(t) each. This is probably not complex enough to produce the shapes in Fig. 1.

So, effectively, if all the things that we could have hoped for anyway are doable, is it really so important that we can’t distinguish between a large number of crazy scenarios? Don’t get me wrong, if the proof of Louca & Pennel is right, I think it’s a useful point, and the differences between models that (supposedly) lead to the same likelihood is quite impressive. I’m just wondering if there is any difference for the typical analyses in the field, that are run over small clades and model complexity is limited by data anyway. And for large clades, there are many other assumptions that are probably violated by the b(t), d(t) model, including that diversification rates are homogenous across subclades, and independent between lineages.

A final point: Louca & Pennel suggest that inference should concentrate essentially on the pulled speciation and diversification rates, which define the “congruent sets” of diversification scenarios that are compatible with a LTT. I didn’t get what we gain by that. In the end, this is just a re-transformation of the LTT plot that is hard to interpret. The insight that a problem is overparameterized or unidentifiable would suggest to me that we have to think harder about how to make it fitable, e.g. by reducing the number of parameters in the model, or add regularization on the parameters (as we did, e.g., in Maliet et al., 2019, where we fit a change in diversification rate at each time step with a regularization that assumes that diversification rates tend to stay similar between time steps). So, if the results hold, what I would take from the paper is that macroecology has to think hard (possibly harder than before) either about about specific candidate mechanisms and hypotheses, whose predictions can then be contrasted with the data, or about statistical priors, regularisation or null assumptions that make the problem identifiable.

Louca, S., Pennell, M.W., 2019. Phylogenies of extant species are consistent with an infinite array of diversification histories. bioRxiv 719435. https://doi.org/10.1101/719435

Maliet, O., Hartig, F., Morlon, H., 2019. A model with many small shifts for estimating species-specific diversification rates. Nat. Ecol. Evol. 3, 1086–1092. https://doi.org/10.1038/s41559-019-0908-0

Pontarp, M., Bunnefeld, L., Cabral, J.S., Etienne, R.S., Fritz, S.A., Gillespie, R., Graham, C.H., Hagen, O., Hartig, F., Huang, S., Jansson, R., Maliet, O., Münkemüller, T., Pellissier, L., Rangel, T.F., Storch, D., Wiegand, T., Hurlbert, A.H., 2018. The Latitudinal Diversity Gradient: Novel Understanding through Mechanistic Eco-evolutionary Models. Trends Ecol. Evol.

Nee, S., 2006. Birth-Death Models in Macroevolution. Annu. Rev. Ecol. Evol. Syst. 37, 1–17. https://doi.org/10.1146/annurev.ecolsys.37.091305.110035

Condamine, F.L., Rolland, J., Morlon, H., 2013. Macroevolutionary perspectives to environmental change. Ecol. Lett. 16, 72–85.

Rabosky, Daniel L., et al. “BAMM tools: an R package for the analysis of evolutionary dynamics on phylogenetic trees.” *Methods in Ecology and Evolution* 5.7 (2014): 701-707.

H. Morlon, T. L. Parsons, J. B. Plotkin, Reconciling molecular phylogenies with the fossil record. Proceedings of the National Academy of Sciences 108, 16327–16332 (2011).

]]>

BioScience has just published the latest installment of “Scientists’ Warnings“. There have been two previous such Warnings, the latest organised by the same authors in 2017. Quite a few scientists have signed this Warning. I chose not to, although I had signed the previous one in 2017.

I have been hackled, by a colleague from the Economics department, why I don’t rush to present and justify my research activities to the public. Parttaking in societal deliberations about climate change is one such “outreach”. He implied that it is “irresponsible” to work in an ivory tower (although it may actually be more an ivory basement). Reading the repeated Scientists’ Warning, I got a better feeling for why I disagree, and why I didn’t sign this time round. And it has nothing to do with whether I agree, as a private person, with the statement (for the record: I do).

In Neal Stephenson’s book Anathem, scientist are separated from the rest of the world and live in cloisters. They work in different casts, if you like, differentiated by how often they have contact with the outer, secular world: every year, every ten years, every century and every millennium. In between, they receive no mail, no books, no information from the world outside (apart from hearing planes flying over and from discussions with their brothers and sisters in the other casts, who are forbidden to touch any topic topical and current in these conservations). As a result, the “uniarians” discuss and work on issues of near-term, almost immediate nature, while in the extreme the millennarians take the long view.

The explosion of human population size and the resulting devastation that we human inflict on ourselves and the plant (climate change, deforestation, desertification, water pollution, you name it) poses a challenge to those ecologists who sympathise with a 100- or 1000-year view of their research. I sometimes half-jokingly refer to science as that bit of research that is still true in 500 years. That over 11,000 scientists signed the last Warning is perceived as a very strong statement by the “general public” (or so my non-scientific friends tell me). The strength comes from the fact that scientists, by and large, are perceived as impartial, rational and as taking the long view.

I decided not to sign the latest Scientists’ Warning, because my long (or at least mid-term) view is currently extremely clouded. The cacophony of current affairs, media outbursts, scientific and funding rush to Climate Change and (loss of) Biodiversity pushes past reflexion, arguments and understanding. I perceive an increasing proportion of the work in my field to be tainted by advocacy and short-termism. I cringe at oversimplified podium statements of do-gooders of my own discipline, at newspaper interviews and podcasts, for example describing building dams as “destroying biodiversity” because several hectars of riparian forest are lost. (Here the term “biodiversity” is used as synonymous with “nature” or “wild stuff”, not in any of its already too vague actual meanings.) Asked something “simple”, such as “How can we decrease chemical contamination of our environment?”, stuff that we teach in the Bachelor programme, I drop my gaze and stare at my shoes: this is not the right question; this is about moral judgement, about societal values, about political attitude. But these are “short-term views”, and, in my above definition, not science. (The scientific answer is obvious, even to the layperson asking.)

So, for the time being, as a scientist I pull out of street marches, petitions, twitter tirades (well, that was easy) and public calls for “them” to do “something” against climate change and and insect decline. My (private, but science-infused) longer-term view identifies overpopulation, slack in social norms and socially encouraged egotism (“Get rich or die trying”) as underlying problems. As a scientist, I am not qualified to comment on this.

]]>I find one example of this are the **validity concepts** taught in the social sciences and economics (see Wikipedia). In short, those categorize “failure modes” of inference (e.g. construct validity, internal validity, external validity). For sure, ecologists are aware of these problems as well, but in ecology, they are not typically taught as a concise list / framework in the standard curriculum, which I have found to be immensely helpful for students.

Another example is causal inference, and specifically the concept of **mediators, confounders and colliders.** This goes back at least to Pearl 2000 (see also Pearl 2009a,b), and with the popularity of SEMs in ecology, I’m sure that people have at least heard about causal inference in general. However, when reading **the really excellent and highly recommended paper Lederer et al., 2019** *“Control of confounding and reporting of results in causal inference studies. Guidance for authors from editors of respiratory, sleep, and critical care journals.”* in our group seminar, I got the distinct feeling that the practical interpretation of these ideas differs quite strongly between medical and ecological fields.

Lederer et al. first nicely establish an **operational concept of causality** that I would broadly agree with also for ecology: assume we look at the effect of a target variable (something that could be manipulated = predictor) on another variable (the outcome = response) in the presence of other (non-target) variables. The goal of a causal analysis is is to control for these other variables, in such a way that we estimate the same effect size that we would obtain if only the target predictor was manipulated (as in a RCT).

You probably have learned in your intro stats class that, to do so, we have to control for confounders. I am less sure, however, if everyone is clear about what a confounder is. In particular, **c****onfounding is more specific than having a variable that correlates with predictor and response**. The direction is crucial to identify true confounders. For example, Fig. 1 C from the Lederer paper shows a collider, i.e. a variable that is influenced by predictor and response. Although it correlates with predictor and response, correcting for it (or including it) in a multiple regression will create a **collider bias** on the causal link we are interested in (corollary: **including all variables is not always a good thing**). The bottomline of this discussions (and the essence of Pearl 2000, 2009) is that to establish causality for a specific link, we have to close the so-called back-door paths for this link, by

- Controlling for confounders (back-doors, blue paths in the figure)
- Not controlling for colliders, M-Bias, and other similar relationships (red paths)
- It depends on the question whether we should control for mediators (yellow paths)

My impression is that these type of arguments are well-established in the medical and economic literature (in the sense that people regularly use them to defend inclusion / exclusion of variables in a regression), but that they are rarely invoked in the ecological literature.

Moreover, what I really liked about the Lederer paper is their discussion of the **Table 2 fallacy.** The paper recommends that variables included as confounders should NOT be discussed and not be presented in the regression table at all (this is typically Table 2 in a paper, thus the name), because they are themselves usually not corrected for confounding (and they shouldn’t or at least don’t have to be corrected for, see Pearl 2000 / discussion above). Sensible advice, but I think contrary to common practice in standard and SEM regression reporting in ecology.

A cynical (but possibly accurate) explanation for why the Table 2 fallacy is the norm in ecology is that we rarely have a clear target variable / hypothesis, and thus we feel all variables that were used have to be discussed. A side effect is that this makes for the most boring result / discussion sections, where the effect of one variable after the other has to be discussed an interpreted. More importantly, however, each variable that is discussed as a causal effect must be controlled for confounding, or else we should make a clear distinction between the variables that are controlled, and those that aren’t. As I said, Lederer recommend not mentioning uncontrolled variables at all. I’m not sure if that is practical for ecology (as analyses are often semi-explorative), but I have recently been wondering about the option to separate reasonably controlled from possibly confounded variables by a bar or extra section in the regression table.

My only small quibble with the otherwise excellent Lederer paper relates to their comments about significance. First, I strongly support their call for concentrating on parameters and CIs instead of p-values. However, I find their recommendation to avoid the word “not significant” in favor of a vague term such as “the estimate is imprecise” a bad one (this is btw. similar to some other recent papers, e.g. Dushoff et al., 2019, Amrhein et al, 2019, which would make a nice topic for another post). The idea behind this recommendation is that researchers tend to misinterpret n.s. as “no effect”, but it seems to me the response should be to better educate researchers about what n.s. means, not to muddy the waters by hiding the fact that a test was done.

Lederer, D. J., Bell, S. C., Branson, R. D., Chalmers, J. D., Marshall, R., Maslove, D. M., … & Stewart, P. W. (2019) Control of confounding and reporting of results in causal inference studies. Guidance for authors from editors of respiratory, sleep, and critical care journals. *Annals of the American Thoracic Society*, *16*(1), 22-28.

Pearl, J. (2009) Causal inference in statistics: An overview. *Statistics surveys* 3, 96-146.

Pearl, J. (2000 / 2009) *Causality*. Cambridge University Press, 1st / 2nd ed.

Dushoff, J., Kain, M.P. and Bolker, B.M., 2019. I can see clearly now: reinterpreting statistical significance. *Methods in Ecology and Evolution*.

Amrhein, V., Greenland, S. and McShane, B., 2019. Scientists rise up against statistical significance.

]]>

The relationship between species richness and ecosystem function is a field of ecology that has always puzzled me. I learned the scientific rope in a department of vegetation ecologists: vegetation was the result of environmental conditions, and indeed a substantial part of their research was to quantify what a plant species indicates about its environment (think Ellenberg indicator values). While of course species may be absent from a community due to competition, those species that *are* there reflect climate, soil, management.

Thus, when I see a paper showing a **strong** effect of species richness, I feel that there must be something amiss. (This paranoid and blanket scepticism goes far beyond “biodiversity” effects.) Can it really be true that in a give-or-take “natural” system we can boost productivity by 100-200% by having more species? Looking out of my office window, I can make out the Black Forest, and a nice large monoculture of spruce. Will adding *a random local tree species* increase the productivity? And does a mixture of, say, beech and spruce with a higher productivity demonstrate a TSR effect on P?

Actually, this blog post is an appetizer for our re-analysis of Liang et al. (2016, Science). But bear with me for another brief excursion. Let me first repeat an argument I read in Donald Maier’s scathing critique of “biodiversity research” (Maier 2012: “What’s So Good About Biodiversity?”, Springer): When we plot species richness on the x-axis, we assume that the species we count are equivalent. If they weren’t, their number is not helpful, and we should quantify something else, e.g. a trait or their abundance or their composition; but not their *number*. And, when investigating the effect of TSR, the x-axis implies random species composition. If it wasn’t random, then richness would be confounded with something else. (Admittedly Maier put it better, but also more verbose.)

Liang et al. (2017, Science: “Positive biodiversity-productivity relationship predominant in global forest”) present such a figure, with an increase in productivity on from around 3.5 to well over 10 m^{3}ha^{-1}yr^{-1}, as “relative species richness” increases from little to 100% on the x-axis. Such a figure rings my alarm bells. So, together with two BSc students, we re-analysed the data presented in that paper.

There are various points that we consider problematic (be it extremely unrealistic values for P; Euclidean distances between plots on a spherical world; non-stratified sampling of biomes; computation of “bootstrapped” error bars), and we investigated them one by one, but the pivotal point is the x-axis: What does “relative species richness” mean? Quite simply, it is the number of tree species in a plot divided by 270, the highest species richness in the data set considered. (Now that is a tiny bit unfair, but it is essentially what it is. In the rundown of the re-analysis we of course use Liang et al.’s definition.) So, a 10-species plot in Finnland receives a value of 3%, while a plot in Panama gets a value of 100%. Can you spot the problem? Yes: the TSR gradient is in fact a latitudinal gradient. That, in turn, means that the plot does not depict the effect of TSR on P, but of latitude on P!

We were still charmed by the idea of constructing an x-axis that is relative. Instead of “relative to the highest richness in the tropics”, however, we constructed a tree-species richness relative to the highest number of tree species observed *in that region*. So 100% means “as many as you can get around here”, and varies between 5 tree species in Siberia and 500 in Panama.

Using this definition (and stratifying by biome, and correcting for spatial distances on a sphere, and using subsampling correction for error bars) we find — **nothing**. (A tincy effect to the eye indistinguishable from a horizontal line.)

Of course, when looking at each biome separately, we find more or less positive effects, but never as strong as in the original global analysis.

Interested? Read more in our preprint on bioRxiv here!

What to take home? Well, perhaps that observational data are tricky for estimating richness effects. It’s so easy to miss effects and then wrongly attribute changes in productivity to species richness (And yes, I include Duffy et al.’s meta-analysis 2017 in this criticism; it’s part of my paranoid scepticism).

]]>Artificial neural networks, especially deep neural networks and (deep) convolutions neural networks, have become increasingly popular in recent years, dominating most machine learning competitions since the early 2010’s (for reviews about DNN and (D)CNNs see LeCun, Bengio, & Hinton, 2015). In ecology, there are a large number of potential applications for these methods, for example image recognition, analysis of acoustic signals, or any other type of classification tasks for which large datasets are available.

Fig. 1 shows the principle of a DNN – we have a number of input features (predictor variables) that are connected to one or several outputs through several hidden layers of “neurons”. The different layers are connected, so that a large value in a previous layer will create corresponding values in the next, depending on the strength of the connection. The latter is learned / trained by adjusting connections / weights to produce a good fit on the training data.

So, how does one build these kind of models in R? A particularly convenient way is the Keras implementation for R, available since September 2017. Keras is essentially a high-level wrapper that makes the use of other machine learning frameworks more convenient. Tensorflow, theano, or CNTK can be used as backend. As a result, we can create an ANN with n hidden layers in a few lines of code.

As an example, here a deep neural networks, fitted on the iris data set (the data consists of three iris species classes, each with 50 samples of four describing features). We scale the input variables to range (0,1) and “one hot” (=dummy features) encode the response variable. In the output layer, we define three nodes, for each class one. We use the softmax activation function to normalize the output for each node and the ∑ of outputs to range 0,1. For a evaluation of the model quality, keras will split the data in a training and a validation set. The code in Keras is as follows:

A common concern in this type of networks is overfitting (error on test data deviates considerably from training error). We want our model to achieve a high generalization (low test error). There are several ways for regularization, such as introducing weight penalties (e.g. L1, L2), early stopping, weight decay.

The dropout method is one simple and efficient way to regularize our model. Dropout means that nodes and their connections will be randomly dropped with probability p during training. This way an ensemble of thinned sub networks will be trained and averaged for predictions (see Srivastava et. al., 2014 for a detailed explanation).

There is no overall rule for how to set the network architecture (depth and width of layers). In general, the optimization gets harder with the depth of the network. Network parameters can be tuned, but be are of overfitting (i.e. implement an outer cross-validation).

So, what have we gained? In this case, we have applied the methods to a very simple example only, so benefits are limited. In general, however, DNNs are particularly useful where we have large datasets, and complex dependencies that cannot be fit with simpler, traditional statistical models.

The disadvantage is that we end up with a “black box model” that can predict, but is hard to interpret for inference. This topic has often named as one of the main problems of machine learning, and there is much research on new frameworks to address this issue (e.g. DALEX, lime, see also Staniak, M., & Biecek, P. (2018))

By Betteridge’s law, the answer to this question is of course no. Or better: we don’t know. But let’s back up a bit:

Almost a year ago, LaManna and coauthors published a paper in *Science* (1), claiming that conspecific negative density dependence (CNDD) in forests, defined as the effect of local conspecific adult density on the recruit-to-adult ratio in 10x10m and 20x20m quadrats, increases toward the tropics and for rare species.

The strength and clarity of the identified effects was astonishing (at least to us), as were the implicated consequences: both in the original *Science* paper and in their press releases (i, ii), the authors interpret their results as suggesting that CNDD controls species abundance and diversity distributions, thus explaining causally why some species are rare and some are common, and why there is a latitudinal diversity gradient. They repeat these statements on youtube:

In a Technical Comment, published today in *Science *(2), we suggest an alternative, albeit somewhat less glamorous explanation for the results: the statistical CNDD estimators used in LaManna et al. were severely biased. And the strength of the bias depended on species abundance, and several other process and community characteristics that potentially correlate with latitude (Fig. 1, more details in our comment, see also our code on GitHub here). Because of this dependence, all the patterns reported in the original publication can emerge even when no CNDD is present whatsoever. We conclude that the methods used in LaManna et al. cannot even reliably detect the mere presence of CNDD, let alone any of the reported differences in CNDD with latitude or species abundance.

*Science* published a second technical comment by Ryan Chisholm and Tak Fung along with our comment, which reports similar results (Ryan also wrote a blog post about their study here). Moreover, we heard informally that Matteo Detto and colleagues had submitted another comment that was, however, not accepted for publication. We invited both to give a short summary of their conclusions regarding the study:

By Ryan Chisholm: In Chisholm and Fung (3), we show in more detail why the bias arises. LaManna et al. used an unusual “statistical trick”, whereby they transformed some data points but not others prior to model fitting, in order to account for the presence of quadrats with saplings but no adults. This “selective transformation” affected more data points in tropical than in temperate plots, which ultimately led to a greater bias in CNDD estimates in tropical plots and an artefactual latitudinal gradient in CNDD. A second statistical problem with the model was the lack of an intercept term, even though an intercept term was clearly suggested by the data and biologically is needed to account for immigration. After identifying the source of the bias, we performed a more appropriate statistical analysis, which does not use a “selective transformation” and includes an intercept in the model, and, on the same data, found no statistically detectable latitudinal trend in CNDD.

By Matteo Detto: I simulated a spatial neutral model where individuals reproduce and displace their offspring according to Gaussian dispersal and saplings become adults without interacting with neighbors. Both the within site pattern (the rare species bias) and the between sites pattern (the latitudinal gradient) produced by the neutral model were similar to the original patterns presented in LaManna et al., suggesting again that the patterns reported in LaManna et al. may be solely a result of a biased statistical estimator (Fig. 2).

We did not see the response by LaManna et al. [to us, to C&F] before yesterday. If we had seen it before, we would have been happy to point out a few errors and misrepresentations of our arguments, in particular

- The fact that the statistical method for estimating CNDD used in LaManna et al. is biased is a mathematically irrefutable fact (see above / our analysis). LaManna still seem to have problems to grasp that reality when stating wrt our null simulations “Some of these simulations produce spuriously strong CNDD for rare species, leading them to
**suggest**that our methods**might**be biased.” (emphasis our own). We do not know how they define bias, but in our book, a method is biased if it produces wrong estimates in reasonable situations. Everyone that doubts that this is the case is welcome to run our code –~~unfortunately, the reverse is not true, because the code by LaManna et al. is again not made available by the authors~~[Edit: 27.5.18 – it seem the code has now been made available here]. - The only question is how severe the bias is in the specific situation of this paper, and if anything else than bias is responsible for the results. We agree that this question is more difficult to answer, but the arguments brought forward by LaManna to defend the existence of a real signal are not convincing. For example, they state “If this [the bias] were correct, then our estimates of CNDD would be biased toward stronger effects for rare species at any latitude”, completely disregarding a whole paragraph in our comment and even a sentence in our abstract where we explain that a number of processes and factors (including the number of rare species) affects the bias, and that any of these processes might (and in the case of rare species certainly does) change with latitude, which explains why the bias may change with latitude.
- In everything that follows, LaManna et al. conveniently disregard any of the other processes that we have shown to create bias, concentrating entirely on dispersal. Doing so, they first misrepresent how we simulated dispersal, stating “That is why analyses that assume global dispersal, as in Hülsmann and Hartig, underestimate or fail to detect CNDD when it is actually present”, before graciously admitting that we also considered non-global dispersal. This argument is double wrong, first because we did not assume global dispersal, except for a single simulation where we varied the dispersal parameter from zero to global, and secondly, because what they state is exactly the opposite of what we found (under global dispersal, we ALWAYS find CNDD, regardless of whether it is present or not, so there is no way we could “fail to detect CNDD”).
- Going on about dispersal, LaManna et al. suggest that a different dispersal kernel would be more appropriate. We agree that their new kernel corresponds better to measured ecological dispersal kernels, but a) the dispersal kernel we used is (in terms of shape) the dispersal kernel they used in the simulations of their original Science paper, so it is surprising that they are so critical of this choice, and b) given our simulations (see also results by Matteo Detto above), we doubt that the change of the kernel significantly changes our conclusions. However, we will have to look at this in more detail.
~~Unfortunately, data and code for reproducing their results is again not made available by the authors, and the description of the model in the text is certainly not sufficient to reproduce their results~~[Edit: 27.5.18 – it seem the code has now been made available here].

In conclusion, reading all comments and the responses by LaManna et al., we see no reason to revise our statements that

- The statistical methods used in this paper are severely biased, and it is certainly suspicious that the bias creates pattern in null models that look very similar to the reported results
- We wouldn’t know how to properly correct this bias, but we found none of the arguments or simulations of the authors convincing to rule out the hypothesis that all of the presented patterns are caused by processes and factors other than CNDD, in combination with the context-dependent bias.

As a last point: even if the claimed correlation could be more convincingly demonstrated, we think one should be careful about claims of causality between CNDD and large-scale diversity patterns. For example, temperature could be both a cause for higher diversity (via productivity) and stronger importance of pathogen control (CNDD) in the tropics. In such a scenario, both CNDD and diversity might appear to be causally linked, but the correlation is indeed only caused by another process that both affects CNDD and diversity. Therefore, while we think that local CNDD (if it exists) likely has strong effects on local community structure and abundance, in particular spatial patterns, we would be hesitant to postulate that this scales up, i.e. that local CNDD is a major factor for relative abundance at scales > 50m.

**Site note on data / code availability**

*Science *states that the journal aims at increasing the “transparency regarding the evidence on which conclusions are based”, including open data and code, but neither the code, nor the data for the study were deposited at *Science* or another independent data repository. After several emails with the authors, we were able to obtain parts of the code, but not the data. The authors referred us to exiting data sharing agreements with (mostly) their coauthors, which did not allow them to pass on the data and would have required us to request each single dataset with the responsible PI. In the end, we only used the BCI dataset, which was already available to us. We think journals should make stronger efforts to enforce that code and data is deposited in appropriate, permanent repositories. Even if data is not fully open, there should be a mechanism to make data available for reproducibility checks upon request, for example through appropriate data use agreements that must be confirmed prior to access.

**References**

- A. LaManna et al.,
*Science*356, 1389–1392 (2017) - L. Hülsmann & Hartig, F.
*Science*eaar2435 (2018) - A. Chisholm & Fung, T.
*Science*eaar4685 (2018)

]]>

This (co-)guest post by Carsten F. Dormann & Florian Hartig summarizes a comprehensive review on model averaging for predictive inference, just published in Ecological Monographs.

Dormann, C.F., Calabrese, J.M., Guillera-Arroita, G., Matechou, E., Bahn, V., Bartoń, K., *et al.* (in press). Model averaging in ecology: a review of Bayesian, information-theoretic and tactical approaches for predictive inference. *Ecol Monogr*, doi: 10.1002/ecm.1309

When times are dire, and data are scarce, quantitative ecologists (or quantitative scientists in general) often reach into their quiver for an arrow called **model averaging.**

Model averaging refers to the practice of using several models at once for making predictions (the focus of our review), or for inferring parameters (the focus of other papers, and some recent controversy, see, e.g. Banner & Higgs, 2017). There are literally thousands of publications across the disciplines that practice “classical” model averaging, i.e. averaging a few or many models that one could also use “stand-alone”. Additionally, model averaging, as a principle, underlies many of the most commonly used machine-learning methods (e.g. as bagging of trees in random forest or of neural network predictions). We only devoted a few sentences in the appendix of the paper to this, but we think that the link between classical model averaging and machine learning is not sufficiently appreciated and could be further explored.

In ecology, averaging of statistical models is heavily dominated by the “information-theoretical” framework popularised by Burnham & Anderson (2002), while alternative methods that are used in other scientific fields are less well-known. When we set out in March 2015, in the form of a workshop, to conduct a comprehensive review of the wealth of model-averaging approaches, we anticipated this diversity, but not the road full of potholes that we encountered. Studies and information about the topic are fragmented across disciplines, many of which have developed their own ideas and terminology to approach the model-averaging problem. Moreover, the field is largely characterized by a hands-on approach, where alternative ways to average and quantify uncertainties are proposed in abundance; however, with very little “cleaning up” of what works and what doesn’t. As a consequence, what started as a small workshop developed into a multi-author, multi-year activity that culminated in a multi-facetted publication, in which the actual technical description of the various available model-averaging algorithms is only one part.

Apart from mapping the method jungle, our review explains, at least in ecology probably for the first time,

**why and when model averaging works, and what this depends on**(see our explanation of how bias, (co)variance and uncertainty of weight estimation influence the benefits of MA);

**how to quantify the uncertainty of model-averaged predictions**, and why there are substantial problems to achieve good uncertainty estimates.

The goal of this post is to wet your appetite, not to reproduce the entire paper. Thus, in what follows, we will only have a superficial look at the ingredients of each of these points.

The first part of our paper shows how error of model-averaged predictions can be decomposed into bias and error (co)variance of the contributing models, and uncertainty of weight estimation. Some key insights are:

- If our different models err systematically, but equally on the high and the low side, then their average has less bias.
- If our models vary stochastically, but all in the same way, then there is little point in averaging them. MA becomes more useful the lower the covariance between estimates.
- If all our models are more or less great (or poor), we can save us the trouble of estimating weights.

Here are some titbits of explanation:

First off, the prediction uncertainty, quantified e.g. as mean squared error MSE, is the sum of squared bias and variance. Hence, we can decompose the effect of model averaging into its effect on bias, and on variance.

The first point about systematic error is usually not so relevant for statistical models. Classical/typical/good statistical models are unbiased, i.e. their mean prediction does not deviate from the truth. For process-based models, this need not be the case. If a processes is specified wrongly, the model’s predictions may be consistently too high or too low. Averaging predictions from different process models, with biases in either way, should therefore cancel to some extent and hence reduce bias in the averaged prediction, explaining why model averaging is popular in process-based modelling communities such as climate modelling.

The second point about variance is more relevant for statistical models. Variance refers to the fact that an ideal statistical model gets it right on average (no bias), but will still make an error in each single application (variance). For an unbiased model, predictions will have a smaller error if their variance is lower. We show that, as a consequence of error propagation, the variance of the averaged prediction depends on the variance of each contributing model, as well as the **co**-variances among these predictions. Thus, if all models made identical predictions, the covariance would cancel any benefit of averaging variances. If, however, model predictions are perfectly uncorrelated, we get great benefits for their prediction’s variance.

Hang on!? So if my models make very different predictions (which might be worrying for some), only then I get the full benefits of model averaging? Correct!

And it gets more complicated.

There is another factor influencing the variance, which is the weighting of the models. If we threw all models we can get our hands on willy-nilly into an averaging procedure, then surely, we need to sort the wheat from the chaff first. It seems illogical to allow a crappy model to ruin our model average, so we need to downweigh it. Or, as the advice in many papers reads: “Only average plausible models.”

Here, it gets really confusing in the literature, because that’s exactly what many highly successfully machine-learning approaches do **not** do. For example, in bagging, a commonly used machine learning principle, **all** models are averaged, and they are not even weighted!

The underlying issue is that, when estimating model weights, we may accrue substantial uncertainty, and this uncertainty also propagates into our model-averaged prediction (Claeskens & al. 2016)! Indeed, it may often be wiser not to compute model weights, if we already have pre-selected our models, as is the common procedure in economics and with the IPCC earth-system-models.

After having established that model averaging can (in the right circumstances) improve predictions, let us turn to the second presumed benefit of model averaging, a better representation of uncertainty.

A commonly named reason to use model averaging is that we cannot decide which of our candidate models is the correct one, and therefore want to include them all to better represent our structural uncertainty. So then the obvious question is: how do we compute an uncertainty estimate of a model average? As ingredients we (possibly) have (a) a prediction from each model, (b) a standard error for each model’s prediction, e.g. from bootstrapping, (c) the model weights, and (d) the unknown uncertainty in the model weights. How to brew them into one 95% confidence interval of the model-averaged prediction?

Again, we shall not disclose the details as given in the paper, but this issue caused some serious head-scratching among the authors (each by herself, of course).

As a teaser: there are a few proposals of how to construct frequentist confidence intervals, but they are by-and-large problematic. Some assume perfect correlation of predictions and “non-standard mathematics”, others assume perfect independence and work surprisingly well in our little test-run. (Our personal all-time favourite, the full model, did of course best, but that is not a very helpful finding for any process modeller.)

However, it should be noted that things are not so bad if one is only interested in a predictive error (which can be obtained by cross-validation), or if one works Bayesian, as posterior predictive intervals are more naturally to compute.

Finally, we come to the topic that you all must have waited for: what’s the best method to compute the weights? We gave it away already: it’s hard to say, because there are **many** proposals out there, far more than informative method comparisons.

We divided the method-zoo into three sections: one for Bayesians, one for “IC folks”, and one for practically-oriented folks (aka machine learners & co).

The pure **Bayesian **side is theoretically simple, but difficult implementation-wise (we’re talking here about the problem of estimating marginal likelihoods of the models, e.g. by reversible-jump MCMC or some other approximations).

The **information theoretical** approaches are theoretically somewhat more dubious (because they seem to strongly head into the Bayesian direction, with model weights being something akin to model probabilities, but then verbally shun Bayesian viewpoints), but well established computationally.

The smorgasbord (this word was chosen to reflect the European dominance in the author collective) of approaches not fitting either category, which we labelled **tactical**, comprised the sound and obscure. In short, we summarize here all the approaches that directly target a reduction of predictive error, be it by machine-learning principles or verbal argument. Key examples here are *stacking* and *jackknife* model averaging.

Detailed explanations of each approach are given in the paper, and we also ran most methods through two case studies. We found little in our results to justify the dominance of AIC-based model averaging. And model-averaging did not necessarily outperform single models.

Model averaging has no super-powers. Claims of “combining the best from all models” are plain nonsense. Like most other statistical methods, at close inspection, we see that model averaging has benefits and costs, and an analyst must weigh them carefully against each other to decide which approach is most suitable for their problem.

Benefits include a possible reduction of predictive error. Costs include the fact that this does not always work. And that confidence intervals (and also p-values) are difficult to provide.

To reduce prediction error, we recommend cross validation-based approaches, which are specifically designed to achieve this goal. Embracing model structural uncertainty is certainly a laudable ambition, but the precise mathematics are complicated, and robust methods that work out of the box are not yet worked out.

Banner, K. M. and M. D. Higgs (2017) Considerations for assessing model averaging of regression coefficients. Ecological Applications, 28:78–93.

Burnham KP, Anderson DR (2002) Model Selection and Multi-Model Inference: a Practical Information-Theoretical Approach. 2nd ed. Berlin: Springer.

Claeskens G, Magnus JR, Vasnev AL, Wang W. The forecast combination puzzle: A simple theoretical explanation. International Journal of Forecasting. 2016;32:754–62.

Dormann, C.F., Calabrese, J.M., Guillera-Arroita, G., Matechou, E., Bahn, V., Bartoń, K., *et al.* (in press). Model averaging in ecology: a review of Bayesian, information-theoretic and tactical approaches for predictive inference. *Ecol Monogr*, doi: 10.1002/ecm.1309

Technical statistical mistakes are overrated; ecologists (especially students) worry too much about them. Individually and collectively, technical statistical mistakes hardly ever appreciably slow the progress of entire subfields or sub-subfields. And fixing them rarely meaningfully accelerates progress.

continuing with

Don’t agree? Try this exercise: name the most important purely technical statistical mistake in ecological history. And make the case that it seriously held back scientific progress.

I would argue that nothing could be further from the truth. It’s actually no challenge at all to point out massive statistical problems that slow down progress in ecology, and not only because of this, but also simply because using inappropriate methods “is the wrong thing to do” for a scientist, I very much hope that students worry about this topic. Let me give a few examples

Statistical errors must not always be massive and obvious to have an impact on the wider field.

IF A LOT OF SMALL PEOPLE IN A LOT OF SMALL PLACES DO A LOT OF SMALL THINGS, THEY CAN CHANGE THE FACE OF THE WORLD (possibly an African proverb, but surely a graffiti on the Berlin wall)

In the last years, there has been a widespread debate throughout the sciences about the reliability / replicability of scientific results (I blogged about this a few years back here and here, but there have been many new developments since – a recent collection of papers in PNAS provides a great, although somewhat broader overview).

The statistical issue I’m referring to is the impact of analysis decisions like

- Changing the hypotheses (predictor or response variables) during the analysis, e.g. trying out various combinations of predictors and response variables to see if the results are “improved” or what is “interesting”.
**This includes looking at the data before the analysis and deciding based on that what tests to make!** - Making data collection dependent on results, e.g. collect a bit more data if there seems to be an effect, or removing data if it seems “weird”, or here
- trying out different statistical tests and use those that produce “better” = more significant results
- etc. etc.

I think few people that are involved in teaching ecological statistics will dispute that these strategies, known as p-hacking, data-dredging, fishing and harking (hypothesizing after results are known) are widespread in ecology, and a large body of research shows that they tend to have a substantial impacts on the rate at which false positives are produced (see, e.g., Simmons et al., or the mind-boggling Brian Wansink story).

Could this be solved? Of course it could – the solution is well-known. For a confirmatory analysis, you need to fix your hypothesis before the data collection and stick with it. Best with a pre-registered analysis plan. I once suggested this to a colleague from an empirical ecology group, and was told “Are you crazy? If we did this, our students would never finish their PhD – the original hypothesis hardly ever checks out” … any questions about whether there are issues in ecology?

Side note – I’m all for giving exploratory analyses more weight in science, see e.g. here, but exploratory analysis = being honest about the goal. Fishing != exploratory analysis!

The second issue I’m seeing is that there are widely accepted analysis strategies in ecology that are statistically unsound. The best example I have is the analysis chain of

- Perform AIC selection
- Present regression table of the AIC selected model

What few people realize is that, while AIC selection alone is useful, and regression tables alone are useful as well, the **combination of an AIC selection with a subsequent regression table is problematic**. Specifically, in combination, the p-values in the regression table will generally be incorrect, because they do not account for the earlier AIC selection (how should they, your R command doesn’t know you did a selection). If you don’t believe me that this is a problem, try this

The full model has correct type I error rates of approximately 5%. Here’s the result after model selection – let me remind you that none of these variables truly has an effect on the response. I am pretty certain that I could get such an analysis into an ecology journal, writing a nice discussion about the ecological sense of each of these “effects”, and why our results differ from some previous studies etc. bla bla. This is why I don’t (and neither should you) do model selection for hypothesis-driven analyses!

Finally, as a third category, let’s come to statistical methods that are fundamentally flawed in the first place. I could name a whole list of issues off the top of my head, including

- Fitting power laws by log-log linear regression on size classes, which produces biased estimates and significantly distorted efforts to test metabolic scaling theories (see an old post here).
- Regressions on beta diversity / community indices, which are notoriously unstable / dependent on other things; as well as regressions on network indices, which have the same problems. Lots of spurious results produced in these fields over the years. Incidentally, null models are not a panacea, although they help.
- And of course, there is a long list of papers that made good old plain mistakes in the analysis, whose correction completely changes the conclusions. Lisa Hülsmann and I have a technical comment forthcoming that will be discussed in a future post, but here is an old example.

You might point out that we still have to show that all this has an impact on ecological progress. It’s a tricky task, because the question itself leaves a lot of wiggle room – what is the definition of progress in the first place, and how would you know that progress has been slowed down, as long as money comes in and papers get published?

I know it’s not 100% fair, but let me turn this question around: if it didn’t matter for the wider field if what we report as scientific facts is correct or not, why go through all the painstaking work to collect data in the first place? By the same logic, I could write:

[

irony on] Young people worry far too much about data collection, instead of just inventing data. I challenge you to name the most important data fabrication in ecological history. And make the case that it seriously held back scientific progress [irony off].

Moreover, I find it very hard to believe that there is no adverse effect of producing a lot of wrong results in any scientific field. In the best case, by creating noisy results, we’re less effective than we could be, burning money and slowing down a movement in the right direction. In the worse case, we could go into a wrong direction altogether, as it might have happened recently in psychology.

But even if there was no effect on the progress of science (which I think there is), I’d argue in good old greek tradition that **using inappropriate tools and producing wrong results is simply not the right thing to do as a scientist**. It’s undermining the ethics, aesthetics and professional practices of science, and regardless of whether it directly affects progress, I’m quite happy for any student that worries about using the appropriate tools!

ps: of course, one can worry about things that are not important. using a t-test on non-normal data is often not a big issue. But to know this, you have to worry first, and then test it out!

pps: I’m not saying that stats is the only thing one has to worry about. Good theory / hypotheses are another one of course, as is clear thinking. But I think stats + experimental design is quite central to getting science right.

[edit 6.5.18] after writing this post, I became aware of the study “Wang et al. (2018) Irreproducible text‐book “knowledge”: The effects of color bands on zebra finch fitness” which seems to show at least one example where a field maintains a wrong conclusion for due to lower power / research degrees of freedom / selective reporting, comparable to what’s going on in psychology.

]]>

This guest post by Carsten F. Dormann, with inputs from Casper Kraan and the panel (see below) summarises the results from the short workshop “Biotic interactions and joint species distribution models” at the Ecology Across Borders BES/GfÖ/NEVECOL/EEF-meeting 2017 in Ghent, Belgium. The purpose of this event was to exchange thoughts and…]]>

A reblog from AK Computational Ecology, summarizing a panel discussion I participated in on Biotic interactions and jSDM at the Ecology Across Borders conference in Ghent, Dez 2017.

This guest post by Carsten F. Dormann, with inputs from Casper Kraan and the panel (see below) summarises the results from the short workshop “Biotic interactions and joint species distribution models” at the Ecology Across Borders BES/GfÖ/NEVECOL/EEF-meeting 2017 in Ghent, Belgium. The purpose of this event was to exchange thoughts and questions about joint Species Distribution Models (jSDMs) and their ecological interpretation, in particular as indicators of biotic interactions.

The workshop was organised and moderated by Carsten Dormann and Casper Kraan (who regrettably was ill and could not attend). A panel of five people using/developing jSDMs answered questions (or comment on points of views) expressed by the workshop participants (“audience”): Heidi Mod (Uni Lausanne, CH), Jörn Pagel (Uni Hohenheim, D), Melinda de Jonge (Radboud Uni, NL), Florian Hartig (Uni Regensburg, D) and Nick Golding (Uni Melbourne, AUS).

In the workshop, we implicitly…

View original post 2,478 more words

In the R environment and beyond, a large number of packages exist that estimate posterior distributions via MCMC sampling, either for specific statistical models (e.g. MCMCglmm, INLA), or via general model specifications languages (such as JAGS, STAN).

Most of these packages are not designed for sampling from an arbitrary target density provided as an R function. For good reasons, as statistical modelers will seldom have the need for such a tool – it’s hard to come up with a likelihood that cannot be specified in one of the existing packages, and if so, the problem is usually so complicated that it requires especially adopted MCMC strategies.

It’s another story, however, for people that work with process-based models (simulation models, differential equation models, …). These models are usually implemented in standalone code that cannot easily be integrated into JAGS or STAN (yes, I know STAN has an ODE solver, but generally speaking … ). What we can do, however, is to call any process model from R, calculate model predictions and then calculate a likelihood based on some distributional assumptions, e.g. as in the following pseudocode.

likelihood = function(param){ predicted = processmodel(param) ll = sum(dnorm(observed, mean = predicted, sd =param[x], log = T)) return(ll) }

This leaves us with a function likelihood(par) or posterior(par) that we want to sample from.

BayesianTools is a toolbox for this problem. It provides general-purpose MCMC and SMC samplers, as well as plot and diagnostic functions for Bayesian statistics, with a particular focus on calibrating complex system models. The samplers implemented in the package are optimized for problematic target functions (strong correlations, multimodal targets) that are more commonly found in process-based than in statistical models. Available samplers include various Metropolis MCMC variants (including adaptive and/or delayed rejection MH), the T-walk, two differential evolution MCMCs, two DREAM MCMCs, and a sequential Monte Carlo (SMC) particle filter.

Here an example how to run an MCMC on a likelihood function in BayesianTools

library(BayesianTools) setup = createBayesianSetup(likelihood, lower = c(-100,-100,0), upper = c(100,100,100)) out = runMCMC(setup, sampler = "DEzs", settings = NULL)

Obviously, you should read the help for details. The package supports most common summaries and diagnostics, including convergence checks, plots, and model selection indices, such as DIC, WAIC, or marginal likelihood (to calculate the Bayes factor).

gelmanDiagnostics(out) summary(out) plot(out, start = 1000) correlationPlot(out, start = 1000) marginalPlot(out, start = 1000) MAP(out) DIC(out) WAIC(out) # requires special definition of the likelihood, see help marginalLikelihood(out)

Below some example outputs obtained from calibrating the VSEM model, a simple ecosystem model that is provided as a test model in the package. The code to produce these plots is available when you type ?VSEM in R.

**References**

Hartig, F, Minunno, F, Paul, S (2017) BayesianTools: General-Purpose MCMC and SMC Samplers and Tools for Bayesian Statistics. R package version 0.1.3 [CRAN] [GitHub]

]]>