I find one example of this are the **validity concepts** taught in the social sciences and economics (see Wikipedia). In short, those categorize “failure modes” of inference (e.g. construct validity, internal validity, external validity). For sure, ecologists are aware of these problems as well, but in ecology, they are not typically taught as a concise list / framework in the standard curriculum, which I have found to be immensely helpful for students.

Another example is causal inference, and specifically the concept of **mediators, confounders and colliders.** This goes back at least to Pearl 2000 (see also Pearl 2009a,b), and with the popularity of SEMs in ecology, I’m sure that people have at least heard about causal inference in general. However, when reading **the really excellent and highly recommended paper Lederer et al., 2019** *“Control of confounding and reporting of results in causal inference studies. Guidance for authors from editors of respiratory, sleep, and critical care journals.”* in our group seminar, I got the distinct feeling that the practical interpretation of these ideas differs quite strongly between medical and ecological fields.

Lederer et al. first nicely establish an **operational concept of causality** that I would broadly agree with also for ecology: assume we look at the effect of a target variable (something that could be manipulated = predictor) on another variable (the outcome = response) in the presence of other (non-target) variables. The goal of a causal analysis is is to control for these other variables, in such a way that we estimate the same effect size that we would obtain if only the target predictor was manipulated (as in a RCT).

You probably have learned in your intro stats class that, to do so, we have to control for confounders. I am less sure, however, if everyone is clear about what a confounder is. In particular, **c****onfounding is more specific than having a variable that correlates with predictor and response**. The direction is crucial to identify true confounders. For example, Fig. 1 C from the Lederer paper shows a collider, i.e. a variable that is influenced by predictor and response. Although it correlates with predictor and response, correcting for it (or including it) in a multiple regression will create a **collider bias** on the causal link we are interested in (corollary: **including all variables is not always a good thing**). The bottomline of this discussions (and the essence of Pearl 2000, 2009) is that to establish causality for a specific link, we have to close the so-called back-door paths for this link, by

- Controlling for confounders (back-doors, blue paths in the figure)
- Not controlling for colliders, M-Bias, and other similar relationships (red paths)
- It depends on the question whether we should control for mediators (yellow paths)

My impression is that these type of arguments are well-established in the medical and economic literature (in the sense that people regularly use them to defend inclusion / exclusion of variables in a regression), but that they are rarely invoked in the ecological literature.

Moreover, what I really liked about the Lederer paper is their discussion of the **Table 2 fallacy.** The paper recommends that variables included as confounders should NOT be discussed and not be presented in the regression table at all (this is typically Table 2 in a paper, thus the name), because they are themselves usually not corrected for confounding (and they shouldn’t or at least don’t have to be corrected for, see Pearl 2000 / discussion above). Sensible advice, but I think contrary to common practice in standard and SEM regression reporting in ecology.

A cynical (but possibly accurate) explanation for why the Table 2 fallacy is the norm in ecology is that we rarely have a clear target variable / hypothesis, and thus we feel all variables that were used have to be discussed. A side effect is that this makes for the most boring result / discussion sections, where the effect of one variable after the other has to be discussed an interpreted. More importantly, however, each variable that is discussed as a causal effect must be controlled for confounding, or else we should make a clear distinction between the variables that are controlled, and those that aren’t. As I said, Lederer recommend not mentioning uncontrolled variables at all. I’m not sure if that is practical for ecology (as analyses are often semi-explorative), but I have recently been wondering about the option to separate reasonably controlled from possibly confounded variables by a bar or extra section in the regression table.

My only small quibble with the otherwise excellent Lederer paper relates to their comments about significance. First, I strongly support their call for concentrating on parameters and CIs instead of p-values. However, I find their recommendation to avoid the word “not significant” in favor of a vague term such as “the estimate is imprecise” a bad one (this is btw. similar to some other recent papers, e.g. Dushoff et al., 2019, Amrhein et al, 2019, which would make a nice topic for another post). The idea behind this recommendation is that researchers tend to misinterpret n.s. as “no effect”, but it seems to me the response should be to better educate researchers about what n.s. means, not to muddy the waters by hiding the fact that a test was done.

Lederer, D. J., Bell, S. C., Branson, R. D., Chalmers, J. D., Marshall, R., Maslove, D. M., … & Stewart, P. W. (2019) Control of confounding and reporting of results in causal inference studies. Guidance for authors from editors of respiratory, sleep, and critical care journals. *Annals of the American Thoracic Society*, *16*(1), 22-28.

Pearl, J. (2009) Causal inference in statistics: An overview. *Statistics surveys* 3, 96-146.

Pearl, J. (2000 / 2009) *Causality*. Cambridge University Press, 1st / 2nd ed.

Dushoff, J., Kain, M.P. and Bolker, B.M., 2019. I can see clearly now: reinterpreting statistical significance. *Methods in Ecology and Evolution*.

Amrhein, V., Greenland, S. and McShane, B., 2019. Scientists rise up against statistical significance.

]]>

The relationship between species richness and ecosystem function is a field of ecology that has always puzzled me. I learned the scientific rope in a department of vegetation ecologists: vegetation was the result of environmental conditions, and indeed a substantial part of their research was to quantify what a plant species indicates about its environment (think Ellenberg indicator values). While of course species may be absent from a community due to competition, those species that *are* there reflect climate, soil, management.

Thus, when I see a paper showing a **strong** effect of species richness, I feel that there must be something amiss. (This paranoid and blanket scepticism goes far beyond “biodiversity” effects.) Can it really be true that in a give-or-take “natural” system we can boost productivity by 100-200% by having more species? Looking out of my office window, I can make out the Black Forest, and a nice large monoculture of spruce. Will adding *a random local tree species* increase the productivity? And does a mixture of, say, beech and spruce with a higher productivity demonstrate a TSR effect on P?

Actually, this blog post is an appetizer for our re-analysis of Liang et al. (2016, Science). But bear with me for another brief excursion. Let me first repeat an argument I read in Donald Maier’s scathing critique of “biodiversity research” (Maier 2012: “What’s So Good About Biodiversity?”, Springer): When we plot species richness on the x-axis, we assume that the species we count are equivalent. If they weren’t, their number is not helpful, and we should quantify something else, e.g. a trait or their abundance or their composition; but not their *number*. And, when investigating the effect of TSR, the x-axis implies random species composition. If it wasn’t random, then richness would be confounded with something else. (Admittedly Maier put it better, but also more verbose.)

Liang et al. (2017, Science: “Positive biodiversity-productivity relationship predominant in global forest”) present such a figure, with an increase in productivity on from around 3.5 to well over 10 m^{3}ha^{-1}yr^{-1}, as “relative species richness” increases from little to 100% on the x-axis. Such a figure rings my alarm bells. So, together with two BSc students, we re-analysed the data presented in that paper.

There are various points that we consider problematic (be it extremely unrealistic values for P; Euclidean distances between plots on a spherical world; non-stratified sampling of biomes; computation of “bootstrapped” error bars), and we investigated them one by one, but the pivotal point is the x-axis: What does “relative species richness” mean? Quite simply, it is the number of tree species in a plot divided by 270, the highest species richness in the data set considered. (Now that is a tiny bit unfair, but it is essentially what it is. In the rundown of the re-analysis we of course use Liang et al.’s definition.) So, a 10-species plot in Finnland receives a value of 3%, while a plot in Panama gets a value of 100%. Can you spot the problem? Yes: the TSR gradient is in fact a latitudinal gradient. That, in turn, means that the plot does not depict the effect of TSR on P, but of latitude on P!

We were still charmed by the idea of constructing an x-axis that is relative. Instead of “relative to the highest richness in the tropics”, however, we constructed a tree-species richness relative to the highest number of tree species observed *in that region*. So 100% means “as many as you can get around here”, and varies between 5 tree species in Siberia and 500 in Panama.

Using this definition (and stratifying by biome, and correcting for spatial distances on a sphere, and using subsampling correction for error bars) we find — **nothing**. (A tincy effect to the eye indistinguishable from a horizontal line.)

Of course, when looking at each biome separately, we find more or less positive effects, but never as strong as in the original global analysis.

Interested? Read more in our preprint on bioRxiv here!

What to take home? Well, perhaps that observational data are tricky for estimating richness effects. It’s so easy to miss effects and then wrongly attribute changes in productivity to species richness (And yes, I include Duffy et al.’s meta-analysis 2017 in this criticism; it’s part of my paranoid scepticism).

]]>Artificial neural networks, especially deep neural networks and (deep) convolutions neural networks, have become increasingly popular in recent years, dominating most machine learning competitions since the early 2010’s (for reviews about DNN and (D)CNNs see LeCun, Bengio, & Hinton, 2015). In ecology, there are a large number of potential applications for these methods, for example image recognition, analysis of acoustic signals, or any other type of classification tasks for which large datasets are available.

Fig. 1 shows the principle of a DNN – we have a number of input features (predictor variables) that are connected to one or several outputs through several hidden layers of “neurons”. The different layers are connected, so that a large value in a previous layer will create corresponding values in the next, depending on the strength of the connection. The latter is learned / trained by adjusting connections / weights to produce a good fit on the training data.

So, how does one build these kind of models in R? A particularly convenient way is the Keras implementation for R, available since September 2017. Keras is essentially a high-level wrapper that makes the use of other machine learning frameworks more convenient. Tensorflow, theano, or CNTK can be used as backend. As a result, we can create an ANN with n hidden layers in a few lines of code.

As an example, here a deep neural networks, fitted on the iris data set (the data consists of three iris species classes, each with 50 samples of four describing features). We scale the input variables to range (0,1) and “one hot” (=dummy features) encode the response variable. In the output layer, we define three nodes, for each class one. We use the softmax activation function to normalize the output for each node and the ∑ of outputs to range 0,1. For a evaluation of the model quality, keras will split the data in a training and a validation set. The code in Keras is as follows:

A common concern in this type of networks is overfitting (error on test data deviates considerably from training error). We want our model to achieve a high generalization (low test error). There are several ways for regularization, such as introducing weight penalties (e.g. L1, L2), early stopping, weight decay.

The dropout method is one simple and efficient way to regularize our model. Dropout means that nodes and their connections will be randomly dropped with probability p during training. This way an ensemble of thinned sub networks will be trained and averaged for predictions (see Srivastava et. al., 2014 for a detailed explanation).

There is no overall rule for how to set the network architecture (depth and width of layers). In general, the optimization gets harder with the depth of the network. Network parameters can be tuned, but be are of overfitting (i.e. implement an outer cross-validation).

So, what have we gained? In this case, we have applied the methods to a very simple example only, so benefits are limited. In general, however, DNNs are particularly useful where we have large datasets, and complex dependencies that cannot be fit with simpler, traditional statistical models.

The disadvantage is that we end up with a “black box model” that can predict, but is hard to interpret for inference. This topic has often named as one of the main problems of machine learning, and there is much research on new frameworks to address this issue (e.g. DALEX, lime, see also Staniak, M., & Biecek, P. (2018))

By Betteridge’s law, the answer to this question is of course no. Or better: we don’t know. But let’s back up a bit:

Almost a year ago, LaManna and coauthors published a paper in *Science* (1), claiming that conspecific negative density dependence (CNDD) in forests, defined as the effect of local conspecific adult density on the recruit-to-adult ratio in 10x10m and 20x20m quadrats, increases toward the tropics and for rare species.

The strength and clarity of the identified effects was astonishing (at least to us), as were the implicated consequences: both in the original *Science* paper and in their press releases (i, ii), the authors interpret their results as suggesting that CNDD controls species abundance and diversity distributions, thus explaining causally why some species are rare and some are common, and why there is a latitudinal diversity gradient. They repeat these statements on youtube:

In a Technical Comment, published today in *Science *(2), we suggest an alternative, albeit somewhat less glamorous explanation for the results: the statistical CNDD estimators used in LaManna et al. were severely biased. And the strength of the bias depended on species abundance, and several other process and community characteristics that potentially correlate with latitude (Fig. 1, more details in our comment, see also our code on GitHub here). Because of this dependence, all the patterns reported in the original publication can emerge even when no CNDD is present whatsoever. We conclude that the methods used in LaManna et al. cannot even reliably detect the mere presence of CNDD, let alone any of the reported differences in CNDD with latitude or species abundance.

*Science* published a second technical comment by Ryan Chisholm and Tak Fung along with our comment, which reports similar results (Ryan also wrote a blog post about their study here). Moreover, we heard informally that Matteo Detto and colleagues had submitted another comment that was, however, not accepted for publication. We invited both to give a short summary of their conclusions regarding the study:

By Ryan Chisholm: In Chisholm and Fung (3), we show in more detail why the bias arises. LaManna et al. used an unusual “statistical trick”, whereby they transformed some data points but not others prior to model fitting, in order to account for the presence of quadrats with saplings but no adults. This “selective transformation” affected more data points in tropical than in temperate plots, which ultimately led to a greater bias in CNDD estimates in tropical plots and an artefactual latitudinal gradient in CNDD. A second statistical problem with the model was the lack of an intercept term, even though an intercept term was clearly suggested by the data and biologically is needed to account for immigration. After identifying the source of the bias, we performed a more appropriate statistical analysis, which does not use a “selective transformation” and includes an intercept in the model, and, on the same data, found no statistically detectable latitudinal trend in CNDD.

By Matteo Detto: I simulated a spatial neutral model where individuals reproduce and displace their offspring according to Gaussian dispersal and saplings become adults without interacting with neighbors. Both the within site pattern (the rare species bias) and the between sites pattern (the latitudinal gradient) produced by the neutral model were similar to the original patterns presented in LaManna et al., suggesting again that the patterns reported in LaManna et al. may be solely a result of a biased statistical estimator (Fig. 2).

We did not see the response by LaManna et al. [to us, to C&F] before yesterday. If we had seen it before, we would have been happy to point out a few errors and misrepresentations of our arguments, in particular

- The fact that the statistical method for estimating CNDD used in LaManna et al. is biased is a mathematically irrefutable fact (see above / our analysis). LaManna still seem to have problems to grasp that reality when stating wrt our null simulations “Some of these simulations produce spuriously strong CNDD for rare species, leading them to
**suggest**that our methods**might**be biased.” (emphasis our own). We do not know how they define bias, but in our book, a method is biased if it produces wrong estimates in reasonable situations. Everyone that doubts that this is the case is welcome to run our code –~~unfortunately, the reverse is not true, because the code by LaManna et al. is again not made available by the authors~~[Edit: 27.5.18 – it seem the code has now been made available here]. - The only question is how severe the bias is in the specific situation of this paper, and if anything else than bias is responsible for the results. We agree that this question is more difficult to answer, but the arguments brought forward by LaManna to defend the existence of a real signal are not convincing. For example, they state “If this [the bias] were correct, then our estimates of CNDD would be biased toward stronger effects for rare species at any latitude”, completely disregarding a whole paragraph in our comment and even a sentence in our abstract where we explain that a number of processes and factors (including the number of rare species) affects the bias, and that any of these processes might (and in the case of rare species certainly does) change with latitude, which explains why the bias may change with latitude.
- In everything that follows, LaManna et al. conveniently disregard any of the other processes that we have shown to create bias, concentrating entirely on dispersal. Doing so, they first misrepresent how we simulated dispersal, stating “That is why analyses that assume global dispersal, as in Hülsmann and Hartig, underestimate or fail to detect CNDD when it is actually present”, before graciously admitting that we also considered non-global dispersal. This argument is double wrong, first because we did not assume global dispersal, except for a single simulation where we varied the dispersal parameter from zero to global, and secondly, because what they state is exactly the opposite of what we found (under global dispersal, we ALWAYS find CNDD, regardless of whether it is present or not, so there is no way we could “fail to detect CNDD”).
- Going on about dispersal, LaManna et al. suggest that a different dispersal kernel would be more appropriate. We agree that their new kernel corresponds better to measured ecological dispersal kernels, but a) the dispersal kernel we used is (in terms of shape) the dispersal kernel they used in the simulations of their original Science paper, so it is surprising that they are so critical of this choice, and b) given our simulations (see also results by Matteo Detto above), we doubt that the change of the kernel significantly changes our conclusions. However, we will have to look at this in more detail.
~~Unfortunately, data and code for reproducing their results is again not made available by the authors, and the description of the model in the text is certainly not sufficient to reproduce their results~~[Edit: 27.5.18 – it seem the code has now been made available here].

In conclusion, reading all comments and the responses by LaManna et al., we see no reason to revise our statements that

- The statistical methods used in this paper are severely biased, and it is certainly suspicious that the bias creates pattern in null models that look very similar to the reported results
- We wouldn’t know how to properly correct this bias, but we found none of the arguments or simulations of the authors convincing to rule out the hypothesis that all of the presented patterns are caused by processes and factors other than CNDD, in combination with the context-dependent bias.

As a last point: even if the claimed correlation could be more convincingly demonstrated, we think one should be careful about claims of causality between CNDD and large-scale diversity patterns. For example, temperature could be both a cause for higher diversity (via productivity) and stronger importance of pathogen control (CNDD) in the tropics. In such a scenario, both CNDD and diversity might appear to be causally linked, but the correlation is indeed only caused by another process that both affects CNDD and diversity. Therefore, while we think that local CNDD (if it exists) likely has strong effects on local community structure and abundance, in particular spatial patterns, we would be hesitant to postulate that this scales up, i.e. that local CNDD is a major factor for relative abundance at scales > 50m.

**Site note on data / code availability**

*Science *states that the journal aims at increasing the “transparency regarding the evidence on which conclusions are based”, including open data and code, but neither the code, nor the data for the study were deposited at *Science* or another independent data repository. After several emails with the authors, we were able to obtain parts of the code, but not the data. The authors referred us to exiting data sharing agreements with (mostly) their coauthors, which did not allow them to pass on the data and would have required us to request each single dataset with the responsible PI. In the end, we only used the BCI dataset, which was already available to us. We think journals should make stronger efforts to enforce that code and data is deposited in appropriate, permanent repositories. Even if data is not fully open, there should be a mechanism to make data available for reproducibility checks upon request, for example through appropriate data use agreements that must be confirmed prior to access.

**References**

- A. LaManna et al.,
*Science*356, 1389–1392 (2017) - L. Hülsmann & Hartig, F.
*Science*eaar2435 (2018) - A. Chisholm & Fung, T.
*Science*eaar4685 (2018)

]]>

This (co-)guest post by Carsten F. Dormann & Florian Hartig summarizes a comprehensive review on model averaging for predictive inference, just published in Ecological Monographs.

Dormann, C.F., Calabrese, J.M., Guillera-Arroita, G., Matechou, E., Bahn, V., Bartoń, K., *et al.* (in press). Model averaging in ecology: a review of Bayesian, information-theoretic and tactical approaches for predictive inference. *Ecol Monogr*, doi: 10.1002/ecm.1309

When times are dire, and data are scarce, quantitative ecologists (or quantitative scientists in general) often reach into their quiver for an arrow called **model averaging.**

Model averaging refers to the practice of using several models at once for making predictions (the focus of our review), or for inferring parameters (the focus of other papers, and some recent controversy, see, e.g. Banner & Higgs, 2017). There are literally thousands of publications across the disciplines that practice “classical” model averaging, i.e. averaging a few or many models that one could also use “stand-alone”. Additionally, model averaging, as a principle, underlies many of the most commonly used machine-learning methods (e.g. as bagging of trees in random forest or of neural network predictions). We only devoted a few sentences in the appendix of the paper to this, but we think that the link between classical model averaging and machine learning is not sufficiently appreciated and could be further explored.

In ecology, averaging of statistical models is heavily dominated by the “information-theoretical” framework popularised by Burnham & Anderson (2002), while alternative methods that are used in other scientific fields are less well-known. When we set out in March 2015, in the form of a workshop, to conduct a comprehensive review of the wealth of model-averaging approaches, we anticipated this diversity, but not the road full of potholes that we encountered. Studies and information about the topic are fragmented across disciplines, many of which have developed their own ideas and terminology to approach the model-averaging problem. Moreover, the field is largely characterized by a hands-on approach, where alternative ways to average and quantify uncertainties are proposed in abundance; however, with very little “cleaning up” of what works and what doesn’t. As a consequence, what started as a small workshop developed into a multi-author, multi-year activity that culminated in a multi-facetted publication, in which the actual technical description of the various available model-averaging algorithms is only one part.

Apart from mapping the method jungle, our review explains, at least in ecology probably for the first time,

**why and when model averaging works, and what this depends on**(see our explanation of how bias, (co)variance and uncertainty of weight estimation influence the benefits of MA);

**how to quantify the uncertainty of model-averaged predictions**, and why there are substantial problems to achieve good uncertainty estimates.

The goal of this post is to wet your appetite, not to reproduce the entire paper. Thus, in what follows, we will only have a superficial look at the ingredients of each of these points.

The first part of our paper shows how error of model-averaged predictions can be decomposed into bias and error (co)variance of the contributing models, and uncertainty of weight estimation. Some key insights are:

- If our different models err systematically, but equally on the high and the low side, then their average has less bias.
- If our models vary stochastically, but all in the same way, then there is little point in averaging them. MA becomes more useful the lower the covariance between estimates.
- If all our models are more or less great (or poor), we can save us the trouble of estimating weights.

Here are some titbits of explanation:

First off, the prediction uncertainty, quantified e.g. as mean squared error MSE, is the sum of squared bias and variance. Hence, we can decompose the effect of model averaging into its effect on bias, and on variance.

The first point about systematic error is usually not so relevant for statistical models. Classical/typical/good statistical models are unbiased, i.e. their mean prediction does not deviate from the truth. For process-based models, this need not be the case. If a processes is specified wrongly, the model’s predictions may be consistently too high or too low. Averaging predictions from different process models, with biases in either way, should therefore cancel to some extent and hence reduce bias in the averaged prediction, explaining why model averaging is popular in process-based modelling communities such as climate modelling.

The second point about variance is more relevant for statistical models. Variance refers to the fact that an ideal statistical model gets it right on average (no bias), but will still make an error in each single application (variance). For an unbiased model, predictions will have a smaller error if their variance is lower. We show that, as a consequence of error propagation, the variance of the averaged prediction depends on the variance of each contributing model, as well as the **co**-variances among these predictions. Thus, if all models made identical predictions, the covariance would cancel any benefit of averaging variances. If, however, model predictions are perfectly uncorrelated, we get great benefits for their prediction’s variance.

Hang on!? So if my models make very different predictions (which might be worrying for some), only then I get the full benefits of model averaging? Correct!

And it gets more complicated.

There is another factor influencing the variance, which is the weighting of the models. If we threw all models we can get our hands on willy-nilly into an averaging procedure, then surely, we need to sort the wheat from the chaff first. It seems illogical to allow a crappy model to ruin our model average, so we need to downweigh it. Or, as the advice in many papers reads: “Only average plausible models.”

Here, it gets really confusing in the literature, because that’s exactly what many highly successfully machine-learning approaches do **not** do. For example, in bagging, a commonly used machine learning principle, **all** models are averaged, and they are not even weighted!

The underlying issue is that, when estimating model weights, we may accrue substantial uncertainty, and this uncertainty also propagates into our model-averaged prediction (Claeskens & al. 2016)! Indeed, it may often be wiser not to compute model weights, if we already have pre-selected our models, as is the common procedure in economics and with the IPCC earth-system-models.

After having established that model averaging can (in the right circumstances) improve predictions, let us turn to the second presumed benefit of model averaging, a better representation of uncertainty.

A commonly named reason to use model averaging is that we cannot decide which of our candidate models is the correct one, and therefore want to include them all to better represent our structural uncertainty. So then the obvious question is: how do we compute an uncertainty estimate of a model average? As ingredients we (possibly) have (a) a prediction from each model, (b) a standard error for each model’s prediction, e.g. from bootstrapping, (c) the model weights, and (d) the unknown uncertainty in the model weights. How to brew them into one 95% confidence interval of the model-averaged prediction?

Again, we shall not disclose the details as given in the paper, but this issue caused some serious head-scratching among the authors (each by herself, of course).

As a teaser: there are a few proposals of how to construct frequentist confidence intervals, but they are by-and-large problematic. Some assume perfect correlation of predictions and “non-standard mathematics”, others assume perfect independence and work surprisingly well in our little test-run. (Our personal all-time favourite, the full model, did of course best, but that is not a very helpful finding for any process modeller.)

However, it should be noted that things are not so bad if one is only interested in a predictive error (which can be obtained by cross-validation), or if one works Bayesian, as posterior predictive intervals are more naturally to compute.

Finally, we come to the topic that you all must have waited for: what’s the best method to compute the weights? We gave it away already: it’s hard to say, because there are **many** proposals out there, far more than informative method comparisons.

We divided the method-zoo into three sections: one for Bayesians, one for “IC folks”, and one for practically-oriented folks (aka machine learners & co).

The pure **Bayesian **side is theoretically simple, but difficult implementation-wise (we’re talking here about the problem of estimating marginal likelihoods of the models, e.g. by reversible-jump MCMC or some other approximations).

The **information theoretical** approaches are theoretically somewhat more dubious (because they seem to strongly head into the Bayesian direction, with model weights being something akin to model probabilities, but then verbally shun Bayesian viewpoints), but well established computationally.

The smorgasbord (this word was chosen to reflect the European dominance in the author collective) of approaches not fitting either category, which we labelled **tactical**, comprised the sound and obscure. In short, we summarize here all the approaches that directly target a reduction of predictive error, be it by machine-learning principles or verbal argument. Key examples here are *stacking* and *jackknife* model averaging.

Detailed explanations of each approach are given in the paper, and we also ran most methods through two case studies. We found little in our results to justify the dominance of AIC-based model averaging. And model-averaging did not necessarily outperform single models.

Model averaging has no super-powers. Claims of “combining the best from all models” are plain nonsense. Like most other statistical methods, at close inspection, we see that model averaging has benefits and costs, and an analyst must weigh them carefully against each other to decide which approach is most suitable for their problem.

Benefits include a possible reduction of predictive error. Costs include the fact that this does not always work. And that confidence intervals (and also p-values) are difficult to provide.

To reduce prediction error, we recommend cross validation-based approaches, which are specifically designed to achieve this goal. Embracing model structural uncertainty is certainly a laudable ambition, but the precise mathematics are complicated, and robust methods that work out of the box are not yet worked out.

Banner, K. M. and M. D. Higgs (2017) Considerations for assessing model averaging of regression coefficients. Ecological Applications, 28:78–93.

Burnham KP, Anderson DR (2002) Model Selection and Multi-Model Inference: a Practical Information-Theoretical Approach. 2nd ed. Berlin: Springer.

Claeskens G, Magnus JR, Vasnev AL, Wang W. The forecast combination puzzle: A simple theoretical explanation. International Journal of Forecasting. 2016;32:754–62.

Dormann, C.F., Calabrese, J.M., Guillera-Arroita, G., Matechou, E., Bahn, V., Bartoń, K., *et al.* (in press). Model averaging in ecology: a review of Bayesian, information-theoretic and tactical approaches for predictive inference. *Ecol Monogr*, doi: 10.1002/ecm.1309

Technical statistical mistakes are overrated; ecologists (especially students) worry too much about them. Individually and collectively, technical statistical mistakes hardly ever appreciably slow the progress of entire subfields or sub-subfields. And fixing them rarely meaningfully accelerates progress.

continuing with

Don’t agree? Try this exercise: name the most important purely technical statistical mistake in ecological history. And make the case that it seriously held back scientific progress.

I would argue that nothing could be further from the truth. It’s actually no challenge at all to point out massive statistical problems that slow down progress in ecology, and not only because of this, but also simply because using inappropriate methods “is the wrong thing to do” for a scientist, I very much hope that students worry about this topic. Let me give a few examples

Statistical errors must not always be massive and obvious to have an impact on the wider field.

IF A LOT OF SMALL PEOPLE IN A LOT OF SMALL PLACES DO A LOT OF SMALL THINGS, THEY CAN CHANGE THE FACE OF THE WORLD (possibly an African proverb, but surely a graffiti on the Berlin wall)

In the last years, there has been a widespread debate throughout the sciences about the reliability / replicability of scientific results (I blogged about this a few years back here and here, but there have been many new developments since – a recent collection of papers in PNAS provides a great, although somewhat broader overview).

The statistical issue I’m referring to is the impact of analysis decisions like

- Changing the hypotheses (predictor or response variables) during the analysis, e.g. trying out various combinations of predictors and response variables to see if the results are “improved” or what is “interesting”.
**This includes looking at the data before the analysis and deciding based on that what tests to make!** - Making data collection dependent on results, e.g. collect a bit more data if there seems to be an effect, or removing data if it seems “weird”, or here
- trying out different statistical tests and use those that produce “better” = more significant results
- etc. etc.

I think few people that are involved in teaching ecological statistics will dispute that these strategies, known as p-hacking, data-dredging, fishing and harking (hypothesizing after results are known) are widespread in ecology, and a large body of research shows that they tend to have a substantial impacts on the rate at which false positives are produced (see, e.g., Simmons et al., or the mind-boggling Brian Wansink story).

Could this be solved? Of course it could – the solution is well-known. For a confirmatory analysis, you need to fix your hypothesis before the data collection and stick with it. Best with a pre-registered analysis plan. I once suggested this to a colleague from an empirical ecology group, and was told “Are you crazy? If we did this, our students would never finish their PhD – the original hypothesis hardly ever checks out” … any questions about whether there are issues in ecology?

Side note – I’m all for giving exploratory analyses more weight in science, see e.g. here, but exploratory analysis = being honest about the goal. Fishing != exploratory analysis!

The second issue I’m seeing is that there are widely accepted analysis strategies in ecology that are statistically unsound. The best example I have is the analysis chain of

- Perform AIC selection
- Present regression table of the AIC selected model

What few people realize is that, while AIC selection alone is useful, and regression tables alone are useful as well, the **combination of an AIC selection with a subsequent regression table is problematic**. Specifically, in combination, the p-values in the regression table will generally be incorrect, because they do not account for the earlier AIC selection (how should they, your R command doesn’t know you did a selection). If you don’t believe me that this is a problem, try this

The full model has correct type I error rates of approximately 5%. Here’s the result after model selection – let me remind you that none of these variables truly has an effect on the response. I am pretty certain that I could get such an analysis into an ecology journal, writing a nice discussion about the ecological sense of each of these “effects”, and why our results differ from some previous studies etc. bla bla. This is why I don’t (and neither should you) do model selection for hypothesis-driven analyses!

Finally, as a third category, let’s come to statistical methods that are fundamentally flawed in the first place. I could name a whole list of issues off the top of my head, including

- Fitting power laws by log-log linear regression on size classes, which produces biased estimates and significantly distorted efforts to test metabolic scaling theories (see an old post here).
- Regressions on beta diversity / community indices, which are notoriously unstable / dependent on other things; as well as regressions on network indices, which have the same problems. Lots of spurious results produced in these fields over the years. Incidentally, null models are not a panacea, although they help.
- And of course, there is a long list of papers that made good old plain mistakes in the analysis, whose correction completely changes the conclusions. Lisa Hülsmann and I have a technical comment forthcoming that will be discussed in a future post, but here is an old example.

You might point out that we still have to show that all this has an impact on ecological progress. It’s a tricky task, because the question itself leaves a lot of wiggle room – what is the definition of progress in the first place, and how would you know that progress has been slowed down, as long as money comes in and papers get published?

I know it’s not 100% fair, but let me turn this question around: if it didn’t matter for the wider field if what we report as scientific facts is correct or not, why go through all the painstaking work to collect data in the first place? By the same logic, I could write:

[

irony on] Young people worry far too much about data collection, instead of just inventing data. I challenge you to name the most important data fabrication in ecological history. And make the case that it seriously held back scientific progress [irony off].

Moreover, I find it very hard to believe that there is no adverse effect of producing a lot of wrong results in any scientific field. In the best case, by creating noisy results, we’re less effective than we could be, burning money and slowing down a movement in the right direction. In the worse case, we could go into a wrong direction altogether, as it might have happened recently in psychology.

But even if there was no effect on the progress of science (which I think there is), I’d argue in good old greek tradition that **using inappropriate tools and producing wrong results is simply not the right thing to do as a scientist**. It’s undermining the ethics, aesthetics and professional practices of science, and regardless of whether it directly affects progress, I’m quite happy for any student that worries about using the appropriate tools!

ps: of course, one can worry about things that are not important. using a t-test on non-normal data is often not a big issue. But to know this, you have to worry first, and then test it out!

pps: I’m not saying that stats is the only thing one has to worry about. Good theory / hypotheses are another one of course, as is clear thinking. But I think stats + experimental design is quite central to getting science right.

[edit 6.5.18] after writing this post, I became aware of the study “Wang et al. (2018) Irreproducible text‐book “knowledge”: The effects of color bands on zebra finch fitness” which seems to show at least one example where a field maintains a wrong conclusion for due to lower power / research degrees of freedom / selective reporting, comparable to what’s going on in psychology.

]]>

This guest post by Carsten F. Dormann, with inputs from Casper Kraan and the panel (see below) summarises the results from the short workshop “Biotic interactions and joint species distribution models” at the Ecology Across Borders BES/GfÖ/NEVECOL/EEF-meeting 2017 in Ghent, Belgium. The purpose of this event was to exchange thoughts and…]]>

A reblog from AK Computational Ecology, summarizing a panel discussion I participated in on Biotic interactions and jSDM at the Ecology Across Borders conference in Ghent, Dez 2017.

This guest post by Carsten F. Dormann, with inputs from Casper Kraan and the panel (see below) summarises the results from the short workshop “Biotic interactions and joint species distribution models” at the Ecology Across Borders BES/GfÖ/NEVECOL/EEF-meeting 2017 in Ghent, Belgium. The purpose of this event was to exchange thoughts and questions about joint Species Distribution Models (jSDMs) and their ecological interpretation, in particular as indicators of biotic interactions.

The workshop was organised and moderated by Carsten Dormann and Casper Kraan (who regrettably was ill and could not attend). A panel of five people using/developing jSDMs answered questions (or comment on points of views) expressed by the workshop participants (“audience”): Heidi Mod (Uni Lausanne, CH), Jörn Pagel (Uni Hohenheim, D), Melinda de Jonge (Radboud Uni, NL), Florian Hartig (Uni Regensburg, D) and Nick Golding (Uni Melbourne, AUS).

In the workshop, we implicitly…

View original post 2,478 more words

In the R environment and beyond, a large number of packages exist that estimate posterior distributions via MCMC sampling, either for specific statistical models (e.g. MCMCglmm, INLA), or via general model specifications languages (such as JAGS, STAN).

Most of these packages are not designed for sampling from an arbitrary target density provided as an R function. For good reasons, as statistical modelers will seldom have the need for such a tool – it’s hard to come up with a likelihood that cannot be specified in one of the existing packages, and if so, the problem is usually so complicated that it requires especially adopted MCMC strategies.

It’s another story, however, for people that work with process-based models (simulation models, differential equation models, …). These models are usually implemented in standalone code that cannot easily be integrated into JAGS or STAN (yes, I know STAN has an ODE solver, but generally speaking … ). What we can do, however, is to call any process model from R, calculate model predictions and then calculate a likelihood based on some distributional assumptions, e.g. as in the following pseudocode.

likelihood = function(param){ predicted = processmodel(param) ll = sum(dnorm(observed, mean = predicted, sd =param[x], log = T)) return(ll) }

This leaves us with a function likelihood(par) or posterior(par) that we want to sample from.

BayesianTools is a toolbox for this problem. It provides general-purpose MCMC and SMC samplers, as well as plot and diagnostic functions for Bayesian statistics, with a particular focus on calibrating complex system models. The samplers implemented in the package are optimized for problematic target functions (strong correlations, multimodal targets) that are more commonly found in process-based than in statistical models. Available samplers include various Metropolis MCMC variants (including adaptive and/or delayed rejection MH), the T-walk, two differential evolution MCMCs, two DREAM MCMCs, and a sequential Monte Carlo (SMC) particle filter.

Here an example how to run an MCMC on a likelihood function in BayesianTools

library(BayesianTools) setup = createBayesianSetup(likelihood, lower = c(-100,-100,0), upper = c(100,100,100)) out = runMCMC(setup, sampler = "DEzs", settings = NULL)

Obviously, you should read the help for details. The package supports most common summaries and diagnostics, including convergence checks, plots, and model selection indices, such as DIC, WAIC, or marginal likelihood (to calculate the Bayes factor).

gelmanDiagnostics(out) summary(out) plot(out, start = 1000) correlationPlot(out, start = 1000) marginalPlot(out, start = 1000) MAP(out) DIC(out) WAIC(out) # requires special definition of the likelihood, see help marginalLikelihood(out)

Below some example outputs obtained from calibrating the VSEM model, a simple ecosystem model that is provided as a test model in the package. The code to produce these plots is available when you type ?VSEM in R.

**References**

Hartig, F, Minunno, F, Paul, S (2017) BayesianTools: General-Purpose MCMC and SMC Samplers and Tools for Bayesian Statistics. R package version 0.1.3 [CRAN] [GitHub]

]]>Statistical fluctuations aside, it seems to me that the current situation is relatively stable. Global change / large scale journals such as GCB and GEB are still going strong, but it looks as if they are not growing as fast as in previous years. It might be too early to tell, but I’d venture a guess that the IF for these newer large-scale ecology fields will saturate over the next years. Also, it hurts me a bit to see that the IFs of the more theory-oriented journals such as AmNat and Oikos are still not really keeping up with the with the rest of ecology.

Rank ’16 |
Journal |
Publications |
IF’16 |
5-yr IF ’16 |

1 | TRENDS IN ECOLOGY & EVOLUTION | 72 | 15.27 | 18.35 |

2 | Annual Review of Ecology Evolution and Systematics | 22 | 10.18 | 14.57 |

3 | ISME Journal | 255 | 9.66 | 11.63 |

4 | ECOLOGY LETTERS | 146 | 9.45 | 13.33 |

5 | ECOLOGICAL MONOGRAPHS | 29 | 8.76 | 10.22 |

6 | GLOBAL CHANGE BIOLOGY | 311 | 8.5 | 9.46 |

7 | FRONTIERS IN ECOLOGY AND THE ENVIRONMENT | 51 | 8.04 | 10.84 |

8 | Molecular Ecology Resources | 121 | 7.33 | 6.54 |

9 | MOLECULAR ECOLOGY | 392 | 6.09 | 6.64 |

10 | GLOBAL ECOLOGY AND BIOGEOGRAPHY | 130 | 6.05 | 7.53 |

11 | JOURNAL OF ECOLOGY | 160 | 5.81 | 6.5 |

12 | WILDLIFE MONOGRAPHS | 3 | 5.75 | 5.22 |

13 | Methods in Ecology and Evolution | 160 | 5.71 | 8.63 |

14 | FUNCTIONAL ECOLOGY | 188 | 5.63 | 5.82 |

15 | JOURNAL OF APPLIED ECOLOGY | 186 | 5.3 | 5.99 |

16 | Advances in Ecological Research | 6 | 5.06 | 6.84 |

17 | PROCEEDINGS OF THE ROYAL SOCIETY B-BIOLOGICAL SCIENCES | 542 | 4.94 | 5.42 |

18 | ECOGRAPHY | 123 | 4.9 | 5.37 |

19 | CONSERVATION BIOLOGY | 123 | 4.84 | 5.09 |

20 | ECOLOGY | 325 | 4.81 | 5.77 |

21 | LANDSCAPE AND URBAN PLANNING | 134 | 4.56 | 5.02 |

22 | BULLETIN OF THE AMERICAN MUSEUM OF NATURAL HISTORY | 10 | 4.56 | 6.19 |

23 | JOURNAL OF ANIMAL ECOLOGY | 145 | 4.47 | 5.06 |

24 | DIVERSITY AND DISTRIBUTIONS | 110 | 4.39 | 5.27 |

25 | ECOLOGICAL APPLICATIONS | 202 | 4.31 | 4.93 |

26 | JOURNAL OF BIOGEOGRAPHY | 205 | 4.25 | 4.89 |

27 | EVOLUTION | 234 | 4.2 | 4.56 |

28 | ECOSYSTEMS | 100 | 4.2 | 4.78 |

29 | AMERICAN NATURALIST | 150 | 4.17 | 4.38 |

30 | AGRICULTURE ECOSYSTEMS & ENVIRONMENT | 486 | 4.1 | 4.68 |

31 | Ecosystem Services | 118 | 4.07 | 5.87 |

32 | OIKOS | 182 | 4.03 | 3.86 |

33 | BIOLOGICAL CONSERVATION | 328 | 4.02 | 4.55 |

34 | HEREDITY | 114 | 3.96 | 3.95 |

35 | Biogeosciences | 416 | 3.85 | 4.62 |

36 | Current Opinion in Insect Science | 97 | 3.66 | 3.66 |

37 | MICROBIAL ECOLOGY | 180 | 3.63 | 3.75 |

38 | LANDSCAPE ECOLOGY | 161 | 3.62 | 4.11 |

39 | BEHAVIORAL ECOLOGY | 216 | 3.31 | 3.24 |

1) Note that this post was published in May 2018, but backdated to June 2017 to better reflect the timing of events on my blog timeline

]]>

One potential reason for the low popularity of residual checks in Bayesian analysis may be that one has to code them by hand. I therefore wanted to point out that the DHARMa package (disclosure, I’m the author), which essentially creates the equivalent of posterior predictive simulations for a large number of (G)LM(M)s fitted with MLE, can also be used for Bayesian analysis (see also my earlier post about the package). When using the package, the only step required for Bayesian residual analysis is creating the posterior predictive simulations. The rest (calculating the Bayesian p-values, plotting, and tests) is taken care of by DHARMa.

I want to demonstrate the approach with a synthetic dataset of Beetle counts across an altitudinal gradient. Apart from the altitudinal preference of the species (in ecology called the niche), the data was created with a random intercept on year and additional zero-inflation (full code of everything I do at the end of the post).

Now, we might start in JAGS (or another Bayesian software for that matter) with a simple Poisson GLM, testing for counts ~ alt + alt^2, thus specifying a likelihood such as this one

for (i in 1:nobs) { lambda[i] <- exp(intercept + alt * altitude[i] + alt2 * altitude[i] * altitude[i]) beetles[i]~dpois(lambda[i]) }

To create posterior predictive simulations, add (and observe) the following chunk to the JAGS model code

for (i in 1:nobs) {

beetlesPred[i]~dpois(lambda[i])

}

The nodes beetlesPred are unconnected, so this will cause JAGS to simulate new observations, based on the current model prediction lambda (i.e. posterior predictive simulations).

We can now convert these simulations into into Bayesian p-values with the createDHARMa function. What this essentially does is to measure where the observed data falls on the distribution of simulated data (see code below, and the DHARMa vignette for more explanations of what that does). The resulting residuals are scaled between 0 and 1, and should be roughly uniformly distributed (the non-asymptotic distribution under H0:(model correct) might not be is not entirely uniform, but in simulations so far, I have not seen a single example where you would go seriously wrong assuming it is). Plotting the calculated residuals, we get:

As explained in the DHARMa vignette , this is how overdispersion looks like. As explained in the Vignette, we should now usually first investigate if there is a model misspecification problem, e.g. by plotting residuals against predictors per group. To speed things up, however, knowing that the issue is both a missing random intercept on year and zero-inflation, I have created a model that corrects for both issues.

So, here’s the residual check with the corrected (true model) – now, things looks fine.

When doing this in practice, I highly recommend to not only rely on the overview plots I used here, but also check

- Residuals against all predictors (use plotResiduals with myDHARMaObject$scaledResiduals)
- All previous plots split up across all grouping factors (e.g. plot, year)
- Spatial / temporal autocorrelation

which is all supported by DHARMa.

There is one additional subtlety, which is the question of how to create the posterior predictive simulations for multi-level models. In the example below, I create the simulations conditional on the fitted random effects and the fitted zero-inflation terms. Most textbook examples of posterior predictive simulations I have seen use this approach. There is nothing wrong with this, but one has to be aware that this doesn’t check the full model, but only the final random level, i.e. the Poisson part. The default for the MLE GLMMs in DHARMa is to re-simulate all random effects. I plan to discuss the difference between the two options in more detail in an extra post, but for the moment, let me say that, as a default, I recommend re-simulating the entire model also in a Bayesian analysis. An example for this extended simulations is here.

The full code for the simple (conditional posterior predictive simulation) example is here

]]>