Artificial neural networks, especially deep neural networks and (deep) convolutions neural networks, have become increasingly popular in recent years, dominating most machine learning competitions since the early 2010’s (for reviews about DNN and (D)CNNs see LeCun, Bengio, & Hinton, 2015). In ecology, there are a large number of potential applications for these methods, for example image recognition, analysis of acoustic signals, or any other type of classification tasks for which large datasets are available.

Fig. 1 shows the principle of a DNN – we have a number of input features (predictor variables) that are connected to one or several outputs through several hidden layers of “neurons”. The different layers are connected, so that a large value in a previous layer will create corresponding values in the next, depending on the strength of the connection. The latter is learned / trained by adjusting connections / weights to produce a good fit on the training data.

So, how does one build these kind of models in R? A particularly convenient way is the Keras implementation for R, available since September 2017. Keras is essentially a high-level wrapper that makes the use of other machine learning frameworks more convenient. Tensorflow, theano, or CNTK can be used as backend. As a result, we can create an ANN with n hidden layers in a few lines of code.

As an example, here a deep neural networks, fitted on the iris data set (the data consists of three iris species classes, each with 50 samples of four describing features). We scale the input variables to range (0,1) and “one hot” (=dummy features) encode the response variable. In the output layer, we define three nodes, for each class one. We use the softmax activation function to normalize the output for each node and the ∑ of outputs to range 0,1. For a evaluation of the model quality, keras will split the data in a training and a validation set. The code in Keras is as follows:

A common concern in this type of networks is overfitting (error on test data deviates considerably from training error). We want our model to achieve a high generalization (low test error). There are several ways for regularization, such as introducing weight penalties (e.g. L1, L2), early stopping, weight decay.

The dropout method is one simple and efficient way to regularize our model. Dropout means that nodes and their connections will be randomly dropped with probability p during training. This way an ensemble of thinned sub networks will be trained and averaged for predictions (see Srivastava et. al., 2014 for a detailed explanation).

There is no overall rule for how to set the network architecture (depth and width of layers). In general, the optimization gets harder with the depth of the network. Network parameters can be tuned, but be are of overfitting (i.e. implement an outer cross-validation).

So, what have we gained? In this case, we have applied the methods to a very simple example only, so benefits are limited. In general, however, DNNs are particularly useful where we have large datasets, and complex dependencies that cannot be fit with simpler, traditional statistical models.

The disadvantage is that we end up with a “black box model” that can predict, but is hard to interpret for inference. This topic has often named as one of the main problems of machine learning, and there is much research on new frameworks to address this issue (e.g. DALEX, lime, see also Staniak, M., & Biecek, P. (2018))

By Betteridge’s law, the answer to this question is of course no. Or better: we don’t know. But let’s back up a bit:

Almost a year ago, LaManna and coauthors published a paper in *Science* (1), claiming that conspecific negative density dependence (CNDD) in forests, defined as the effect of local conspecific adult density on the recruit-to-adult ratio in 10x10m and 20x20m quadrats, increases toward the tropics and for rare species.

The strength and clarity of the identified effects was astonishing (at least to us), as were the implicated consequences: both in the original *Science* paper and in their press releases (i, ii), the authors interpret their results as suggesting that CNDD controls species abundance and diversity distributions, thus explaining causally why some species are rare and some are common, and why there is a latitudinal diversity gradient. They repeat these statements on youtube:

In a Technical Comment, published today in *Science *(2), we suggest an alternative, albeit somewhat less glamorous explanation for the results: the statistical CNDD estimators used in LaManna et al. were severely biased. And the strength of the bias depended on species abundance, and several other process and community characteristics that potentially correlate with latitude (Fig. 1, more details in our comment, see also our code on GitHub here). Because of this dependence, all the patterns reported in the original publication can emerge even when no CNDD is present whatsoever. We conclude that the methods used in LaManna et al. cannot even reliably detect the mere presence of CNDD, let alone any of the reported differences in CNDD with latitude or species abundance.

*Science* published a second technical comment by Ryan Chisholm and Tak Fung along with our comment, which reports similar results (Ryan also wrote a blog post about their study here). Moreover, we heard informally that Matteo Detto and colleagues had submitted another comment that was, however, not accepted for publication. We invited both to give a short summary of their conclusions regarding the study:

By Ryan Chisholm: In Chisholm and Fung (3), we show in more detail why the bias arises. LaManna et al. used an unusual “statistical trick”, whereby they transformed some data points but not others prior to model fitting, in order to account for the presence of quadrats with saplings but no adults. This “selective transformation” affected more data points in tropical than in temperate plots, which ultimately led to a greater bias in CNDD estimates in tropical plots and an artefactual latitudinal gradient in CNDD. A second statistical problem with the model was the lack of an intercept term, even though an intercept term was clearly suggested by the data and biologically is needed to account for immigration. After identifying the source of the bias, we performed a more appropriate statistical analysis, which does not use a “selective transformation” and includes an intercept in the model, and, on the same data, found no statistically detectable latitudinal trend in CNDD.

By Matteo Detto: I simulated a spatial neutral model where individuals reproduce and displace their offspring according to Gaussian dispersal and saplings become adults without interacting with neighbors. Both the within site pattern (the rare species bias) and the between sites pattern (the latitudinal gradient) produced by the neutral model were similar to the original patterns presented in LaManna et al., suggesting again that the patterns reported in LaManna et al. may be solely a result of a biased statistical estimator (Fig. 2).

We did not see the response by LaManna et al. [to us, to C&F] before yesterday. If we had seen it before, we would have been happy to point out a few errors and misrepresentations of our arguments, in particular

- The fact that the statistical method for estimating CNDD used in LaManna et al. is biased is a mathematically irrefutable fact (see above / our analysis). LaManna still seem to have problems to grasp that reality when stating wrt our null simulations “Some of these simulations produce spuriously strong CNDD for rare species, leading them to
**suggest**that our methods**might**be biased.” (emphasis our own). We do not know how they define bias, but in our book, a method is biased if it produces wrong estimates in reasonable situations. Everyone that doubts that this is the case is welcome to run our code –~~unfortunately, the reverse is not true, because the code by LaManna et al. is again not made available by the authors~~[Edit: 27.5.18 – it seem the code has now been made available here]. - The only question is how severe the bias is in the specific situation of this paper, and if anything else than bias is responsible for the results. We agree that this question is more difficult to answer, but the arguments brought forward by LaManna to defend the existence of a real signal are not convincing. For example, they state “If this [the bias] were correct, then our estimates of CNDD would be biased toward stronger effects for rare species at any latitude”, completely disregarding a whole paragraph in our comment and even a sentence in our abstract where we explain that a number of processes and factors (including the number of rare species) affects the bias, and that any of these processes might (and in the case of rare species certainly does) change with latitude, which explains why the bias may change with latitude.
- In everything that follows, LaManna et al. conveniently disregard any of the other processes that we have shown to create bias, concentrating entirely on dispersal. Doing so, they first misrepresent how we simulated dispersal, stating “That is why analyses that assume global dispersal, as in Hülsmann and Hartig, underestimate or fail to detect CNDD when it is actually present”, before graciously admitting that we also considered non-global dispersal. This argument is double wrong, first because we did not assume global dispersal, except for a single simulation where we varied the dispersal parameter from zero to global, and secondly, because what they state is exactly the opposite of what we found (under global dispersal, we ALWAYS find CNDD, regardless of whether it is present or not, so there is no way we could “fail to detect CNDD”).
- Going on about dispersal, LaManna et al. suggest that a different dispersal kernel would be more appropriate. We agree that their new kernel corresponds better to measured ecological dispersal kernels, but a) the dispersal kernel we used is (in terms of shape) the dispersal kernel they used in the simulations of their original Science paper, so it is surprising that they are so critical of this choice, and b) given our simulations (see also results by Matteo Detto above), we doubt that the change of the kernel significantly changes our conclusions. However, we will have to look at this in more detail.
~~Unfortunately, data and code for reproducing their results is again not made available by the authors, and the description of the model in the text is certainly not sufficient to reproduce their results~~[Edit: 27.5.18 – it seem the code has now been made available here].

In conclusion, reading all comments and the responses by LaManna et al., we see no reason to revise our statements that

- The statistical methods used in this paper are severely biased, and it is certainly suspicious that the bias creates pattern in null models that look very similar to the reported results
- We wouldn’t know how to properly correct this bias, but we found none of the arguments or simulations of the authors convincing to rule out the hypothesis that all of the presented patterns are caused by processes and factors other than CNDD, in combination with the context-dependent bias.

As a last point: even if the claimed correlation could be more convincingly demonstrated, we think one should be careful about claims of causality between CNDD and large-scale diversity patterns. For example, temperature could be both a cause for higher diversity (via productivity) and stronger importance of pathogen control (CNDD) in the tropics. In such a scenario, both CNDD and diversity might appear to be causally linked, but the correlation is indeed only caused by another process that both affects CNDD and diversity. Therefore, while we think that local CNDD (if it exists) likely has strong effects on local community structure and abundance, in particular spatial patterns, we would be hesitant to postulate that this scales up, i.e. that local CNDD is a major factor for relative abundance at scales > 50m.

**Site note on data / code availability**

*Science *states that the journal aims at increasing the “transparency regarding the evidence on which conclusions are based”, including open data and code, but neither the code, nor the data for the study were deposited at *Science* or another independent data repository. After several emails with the authors, we were able to obtain parts of the code, but not the data. The authors referred us to exiting data sharing agreements with (mostly) their coauthors, which did not allow them to pass on the data and would have required us to request each single dataset with the responsible PI. In the end, we only used the BCI dataset, which was already available to us. We think journals should make stronger efforts to enforce that code and data is deposited in appropriate, permanent repositories. Even if data is not fully open, there should be a mechanism to make data available for reproducibility checks upon request, for example through appropriate data use agreements that must be confirmed prior to access.

**References**

- A. LaManna et al.,
*Science*356, 1389–1392 (2017) - L. Hülsmann & Hartig, F.
*Science*eaar2435 (2018) - A. Chisholm & Fung, T.
*Science*eaar4685 (2018)

]]>

This (co-)guest post by Carsten F. Dormann & Florian Hartig summarizes a comprehensive review on model averaging for predictive inference, just published in Ecological Monographs.

Dormann, C.F., Calabrese, J.M., Guillera-Arroita, G., Matechou, E., Bahn, V., Bartoń, K., *et al.* (in press). Model averaging in ecology: a review of Bayesian, information-theoretic and tactical approaches for predictive inference. *Ecol Monogr*, doi: 10.1002/ecm.1309

When times are dire, and data are scarce, quantitative ecologists (or quantitative scientists in general) often reach into their quiver for an arrow called **model averaging.**

Model averaging refers to the practice of using several models at once for making predictions (the focus of our review), or for inferring parameters (the focus of other papers, and some recent controversy, see, e.g. Banner & Higgs, 2017). There are literally thousands of publications across the disciplines that practice “classical” model averaging, i.e. averaging a few or many models that one could also use “stand-alone”. Additionally, model averaging, as a principle, underlies many of the most commonly used machine-learning methods (e.g. as bagging of trees in random forest or of neural network predictions). We only devoted a few sentences in the appendix of the paper to this, but we think that the link between classical model averaging and machine learning is not sufficiently appreciated and could be further explored.

In ecology, averaging of statistical models is heavily dominated by the “information-theoretical” framework popularised by Burnham & Anderson (2002), while alternative methods that are used in other scientific fields are less well-known. When we set out in March 2015, in the form of a workshop, to conduct a comprehensive review of the wealth of model-averaging approaches, we anticipated this diversity, but not the road full of potholes that we encountered. Studies and information about the topic are fragmented across disciplines, many of which have developed their own ideas and terminology to approach the model-averaging problem. Moreover, the field is largely characterized by a hands-on approach, where alternative ways to average and quantify uncertainties are proposed in abundance; however, with very little “cleaning up” of what works and what doesn’t. As a consequence, what started as a small workshop developed into a multi-author, multi-year activity that culminated in a multi-facetted publication, in which the actual technical description of the various available model-averaging algorithms is only one part.

Apart from mapping the method jungle, our review explains, at least in ecology probably for the first time,

**why and when model averaging works, and what this depends on**(see our explanation of how bias, (co)variance and uncertainty of weight estimation influence the benefits of MA);

**how to quantify the uncertainty of model-averaged predictions**, and why there are substantial problems to achieve good uncertainty estimates.

The goal of this post is to wet your appetite, not to reproduce the entire paper. Thus, in what follows, we will only have a superficial look at the ingredients of each of these points.

The first part of our paper shows how error of model-averaged predictions can be decomposed into bias and error (co)variance of the contributing models, and uncertainty of weight estimation. Some key insights are:

- If our different models err systematically, but equally on the high and the low side, then their average has less bias.
- If our models vary stochastically, but all in the same way, then there is little point in averaging them. MA becomes more useful the lower the covariance between estimates.
- If all our models are more or less great (or poor), we can save us the trouble of estimating weights.

Here are some titbits of explanation:

First off, the prediction uncertainty, quantified e.g. as mean squared error MSE, is the sum of squared bias and variance. Hence, we can decompose the effect of model averaging into its effect on bias, and on variance.

The first point about systematic error is usually not so relevant for statistical models. Classical/typical/good statistical models are unbiased, i.e. their mean prediction does not deviate from the truth. For process-based models, this need not be the case. If a processes is specified wrongly, the model’s predictions may be consistently too high or too low. Averaging predictions from different process models, with biases in either way, should therefore cancel to some extent and hence reduce bias in the averaged prediction, explaining why model averaging is popular in process-based modelling communities such as climate modelling.

The second point about variance is more relevant for statistical models. Variance refers to the fact that an ideal statistical model gets it right on average (no bias), but will still make an error in each single application (variance). For an unbiased model, predictions will have a smaller error if their variance is lower. We show that, as a consequence of error propagation, the variance of the averaged prediction depends on the variance of each contributing model, as well as the **co**-variances among these predictions. Thus, if all models made identical predictions, the covariance would cancel any benefit of averaging variances. If, however, model predictions are perfectly uncorrelated, we get great benefits for their prediction’s variance.

Hang on!? So if my models make very different predictions (which might be worrying for some), only then I get the full benefits of model averaging? Correct!

And it gets more complicated.

There is another factor influencing the variance, which is the weighting of the models. If we threw all models we can get our hands on willy-nilly into an averaging procedure, then surely, we need to sort the wheat from the chaff first. It seems illogical to allow a crappy model to ruin our model average, so we need to downweigh it. Or, as the advice in many papers reads: “Only average plausible models.”

Here, it gets really confusing in the literature, because that’s exactly what many highly successfully machine-learning approaches do **not** do. For example, in bagging, a commonly used machine learning principle, **all** models are averaged, and they are not even weighted!

The underlying issue is that, when estimating model weights, we may accrue substantial uncertainty, and this uncertainty also propagates into our model-averaged prediction (Claeskens & al. 2016)! Indeed, it may often be wiser not to compute model weights, if we already have pre-selected our models, as is the common procedure in economics and with the IPCC earth-system-models.

After having established that model averaging can (in the right circumstances) improve predictions, let us turn to the second presumed benefit of model averaging, a better representation of uncertainty.

A commonly named reason to use model averaging is that we cannot decide which of our candidate models is the correct one, and therefore want to include them all to better represent our structural uncertainty. So then the obvious question is: how do we compute an uncertainty estimate of a model average? As ingredients we (possibly) have (a) a prediction from each model, (b) a standard error for each model’s prediction, e.g. from bootstrapping, (c) the model weights, and (d) the unknown uncertainty in the model weights. How to brew them into one 95% confidence interval of the model-averaged prediction?

Again, we shall not disclose the details as given in the paper, but this issue caused some serious head-scratching among the authors (each by herself, of course).

As a teaser: there are a few proposals of how to construct frequentist confidence intervals, but they are by-and-large problematic. Some assume perfect correlation of predictions and “non-standard mathematics”, others assume perfect independence and work surprisingly well in our little test-run. (Our personal all-time favourite, the full model, did of course best, but that is not a very helpful finding for any process modeller.)

However, it should be noted that things are not so bad if one is only interested in a predictive error (which can be obtained by cross-validation), or if one works Bayesian, as posterior predictive intervals are more naturally to compute.

Finally, we come to the topic that you all must have waited for: what’s the best method to compute the weights? We gave it away already: it’s hard to say, because there are **many** proposals out there, far more than informative method comparisons.

We divided the method-zoo into three sections: one for Bayesians, one for “IC folks”, and one for practically-oriented folks (aka machine learners & co).

The pure **Bayesian **side is theoretically simple, but difficult implementation-wise (we’re talking here about the problem of estimating marginal likelihoods of the models, e.g. by reversible-jump MCMC or some other approximations).

The **information theoretical** approaches are theoretically somewhat more dubious (because they seem to strongly head into the Bayesian direction, with model weights being something akin to model probabilities, but then verbally shun Bayesian viewpoints), but well established computationally.

The smorgasbord (this word was chosen to reflect the European dominance in the author collective) of approaches not fitting either category, which we labelled **tactical**, comprised the sound and obscure. In short, we summarize here all the approaches that directly target a reduction of predictive error, be it by machine-learning principles or verbal argument. Key examples here are *stacking* and *jackknife* model averaging.

Detailed explanations of each approach are given in the paper, and we also ran most methods through two case studies. We found little in our results to justify the dominance of AIC-based model averaging. And model-averaging did not necessarily outperform single models.

Model averaging has no super-powers. Claims of “combining the best from all models” are plain nonsense. Like most other statistical methods, at close inspection, we see that model averaging has benefits and costs, and an analyst must weigh them carefully against each other to decide which approach is most suitable for their problem.

Benefits include a possible reduction of predictive error. Costs include the fact that this does not always work. And that confidence intervals (and also p-values) are difficult to provide.

To reduce prediction error, we recommend cross validation-based approaches, which are specifically designed to achieve this goal. Embracing model structural uncertainty is certainly a laudable ambition, but the precise mathematics are complicated, and robust methods that work out of the box are not yet worked out.

Banner, K. M. and M. D. Higgs (2017) Considerations for assessing model averaging of regression coefficients. Ecological Applications, 28:78–93.

Burnham KP, Anderson DR (2002) Model Selection and Multi-Model Inference: a Practical Information-Theoretical Approach. 2nd ed. Berlin: Springer.

Claeskens G, Magnus JR, Vasnev AL, Wang W. The forecast combination puzzle: A simple theoretical explanation. International Journal of Forecasting. 2016;32:754–62.

Dormann, C.F., Calabrese, J.M., Guillera-Arroita, G., Matechou, E., Bahn, V., Bartoń, K., *et al.* (in press). Model averaging in ecology: a review of Bayesian, information-theoretic and tactical approaches for predictive inference. *Ecol Monogr*, doi: 10.1002/ecm.1309

Technical statistical mistakes are overrated; ecologists (especially students) worry too much about them. Individually and collectively, technical statistical mistakes hardly ever appreciably slow the progress of entire subfields or sub-subfields. And fixing them rarely meaningfully accelerates progress.

continuing with

Don’t agree? Try this exercise: name the most important purely technical statistical mistake in ecological history. And make the case that it seriously held back scientific progress.

I would argue that nothing could be further from the truth. It’s actually no challenge at all to point out massive statistical problems that slow down progress in ecology, and not only because of this, but also simply because using inappropriate methods “is the wrong thing to do” for a scientist, I very much hope that students worry about this topic. Let me give a few examples

Statistical errors must not always be massive and obvious to have an impact on the wider field.

IF A LOT OF SMALL PEOPLE IN A LOT OF SMALL PLACES DO A LOT OF SMALL THINGS, THEY CAN CHANGE THE FACE OF THE WORLD (possibly an African proverb, but surely a graffiti on the Berlin wall)

In the last years, there has been a widespread debate throughout the sciences about the reliability / replicability of scientific results (I blogged about this a few years back here and here, but there have been many new developments since – a recent collection of papers in PNAS provides a great, although somewhat broader overview).

The statistical issue I’m referring to is the impact of analysis decisions like

- Changing the hypotheses (predictor or response variables) during the analysis, e.g. trying out various combinations of predictors and response variables to see if the results are “improved” or what is “interesting”.
**This includes looking at the data before the analysis and deciding based on that what tests to make!** - Making data collection dependent on results, e.g. collect a bit more data if there seems to be an effect, or removing data if it seems “weird”, or here
- trying out different statistical tests and use those that produce “better” = more significant results
- etc. etc.

I think few people that are involved in teaching ecological statistics will dispute that these strategies, known as p-hacking, data-dredging, fishing and harking (hypothesizing after results are known) are widespread in ecology, and a large body of research shows that they tend to have a substantial impacts on the rate at which false positives are produced (see, e.g., Simmons et al., or the mind-boggling Brian Wansink story).

Could this be solved? Of course it could – the solution is well-known. For a confirmatory analysis, you need to fix your hypothesis before the data collection and stick with it. Best with a pre-registered analysis plan. I once suggested this to a colleague from an empirical ecology group, and was told “Are you crazy? If we did this, our students would never finish their PhD – the original hypothesis hardly ever checks out” … any questions about whether there are issues in ecology?

Side note – I’m all for giving exploratory analyses more weight in science, see e.g. here, but exploratory analysis = being honest about the goal. Fishing != exploratory analysis!

The second issue I’m seeing is that there are widely accepted analysis strategies in ecology that are statistically unsound. The best example I have is the analysis chain of

- Perform AIC selection
- Present regression table of the AIC selected model

What few people realize is that, while AIC selection alone is useful, and regression tables alone are useful as well, the **combination of an AIC selection with a subsequent regression table is problematic**. Specifically, in combination, the p-values in the regression table will generally be incorrect, because they do not account for the earlier AIC selection (how should they, your R command doesn’t know you did a selection). If you don’t believe me that this is a problem, try this

The full model has correct type I error rates of approximately 5%. Here’s the result after model selection – let me remind you that none of these variables truly has an effect on the response. I am pretty certain that I could get such an analysis into an ecology journal, writing a nice discussion about the ecological sense of each of these “effects”, and why our results differ from some previous studies etc. bla bla. This is why I don’t (and neither should you) do model selection for hypothesis-driven analyses!

Finally, as a third category, let’s come to statistical methods that are fundamentally flawed in the first place. I could name a whole list of issues off the top of my head, including

- Fitting power laws by log-log linear regression on size classes, which produces biased estimates and significantly distorted efforts to test metabolic scaling theories (see an old post here).
- Regressions on beta diversity / community indices, which are notoriously unstable / dependent on other things; as well as regressions on network indices, which have the same problems. Lots of spurious results produced in these fields over the years. Incidentally, null models are not a panacea, although they help.
- And of course, there is a long list of papers that made good old plain mistakes in the analysis, whose correction completely changes the conclusions. Lisa Hülsmann and I have a technical comment forthcoming that will be discussed in a future post, but here is an old example.

You might point out that we still have to show that all this has an impact on ecological progress. It’s a tricky task, because the question itself leaves a lot of wiggle room – what is the definition of progress in the first place, and how would you know that progress has been slowed down, as long as money comes in and papers get published?

I know it’s not 100% fair, but let me turn this question around: if it didn’t matter for the wider field if what we report as scientific facts is correct or not, why go through all the painstaking work to collect data in the first place? By the same logic, I could write:

[

irony on] Young people worry far too much about data collection, instead of just inventing data. I challenge you to name the most important data fabrication in ecological history. And make the case that it seriously held back scientific progress [irony off].

Moreover, I find it very hard to believe that there is no adverse effect of producing a lot of wrong results in any scientific field. In the best case, by creating noisy results, we’re less effective than we could be, burning money and slowing down a movement in the right direction. In the worse case, we could go into a wrong direction altogether, as it might have happened recently in psychology.

But even if there was no effect on the progress of science (which I think there is), I’d argue in good old greek tradition that **using inappropriate tools and producing wrong results is simply not the right thing to do as a scientist**. It’s undermining the ethics, aesthetics and professional practices of science, and regardless of whether it directly affects progress, I’m quite happy for any student that worries about using the appropriate tools!

ps: of course, one can worry about things that are not important. using a t-test on non-normal data is often not a big issue. But to know this, you have to worry first, and then test it out!

pps: I’m not saying that stats is the only thing one has to worry about. Good theory / hypotheses are another one of course, as is clear thinking. But I think stats + experimental design is quite central to getting science right.

[edit 6.5.18] after writing this post, I became aware of the study “Wang et al. (2018) Irreproducible text‐book “knowledge”: The effects of color bands on zebra finch fitness” which seems to show at least one example where a field maintains a wrong conclusion for due to lower power / research degrees of freedom / selective reporting, comparable to what’s going on in psychology.

]]>

This guest post by Carsten F. Dormann, with inputs from Casper Kraan and the panel (see below) summarises the results from the short workshop “Biotic interactions and joint species distribution models” at the Ecology Across Borders BES/GfÖ/NEVECOL/EEF-meeting 2017 in Ghent, Belgium. The purpose of this event was to exchange thoughts and…]]>

A reblog from AK Computational Ecology, summarizing a panel discussion I participated in on Biotic interactions and jSDM at the Ecology Across Borders conference in Ghent, Dez 2017.

This guest post by Carsten F. Dormann, with inputs from Casper Kraan and the panel (see below) summarises the results from the short workshop “Biotic interactions and joint species distribution models” at the Ecology Across Borders BES/GfÖ/NEVECOL/EEF-meeting 2017 in Ghent, Belgium. The purpose of this event was to exchange thoughts and questions about joint Species Distribution Models (jSDMs) and their ecological interpretation, in particular as indicators of biotic interactions.

The workshop was organised and moderated by Carsten Dormann and Casper Kraan (who regrettably was ill and could not attend). A panel of five people using/developing jSDMs answered questions (or comment on points of views) expressed by the workshop participants (“audience”): Heidi Mod (Uni Lausanne, CH), Jörn Pagel (Uni Hohenheim, D), Melinda de Jonge (Radboud Uni, NL), Florian Hartig (Uni Regensburg, D) and Nick Golding (Uni Melbourne, AUS).

In the workshop, we implicitly…

View original post 2,478 more words

In the R environment and beyond, a large number of packages exist that estimate posterior distributions via MCMC sampling, either for specific statistical models (e.g. MCMCglmm, INLA), or via general model specifications languages (such as JAGS, STAN).

Most of these packages are not designed for sampling from an arbitrary target density provided as an R function. For good reasons, as statistical modelers will seldom have the need for such a tool – it’s hard to come up with a likelihood that cannot be specified in one of the existing packages, and if so, the problem is usually so complicated that it requires especially adopted MCMC strategies.

It’s another story, however, for people that work with process-based models (simulation models, differential equation models, …). These models are usually implemented in standalone code that cannot easily be integrated into JAGS or STAN (yes, I know STAN has an ODE solver, but generally speaking … ). What we can do, however, is to call any process model from R, calculate model predictions and then calculate a likelihood based on some distributional assumptions, e.g. as in the following pseudocode.

likelihood = function(param){ predicted = processmodel(param) ll = sum(dnorm(observed, mean = predicted, sd =param[x], log = T)) return(ll) }

This leaves us with a function likelihood(par) or posterior(par) that we want to sample from.

BayesianTools is a toolbox for this problem. It provides general-purpose MCMC and SMC samplers, as well as plot and diagnostic functions for Bayesian statistics, with a particular focus on calibrating complex system models. The samplers implemented in the package are optimized for problematic target functions (strong correlations, multimodal targets) that are more commonly found in process-based than in statistical models. Available samplers include various Metropolis MCMC variants (including adaptive and/or delayed rejection MH), the T-walk, two differential evolution MCMCs, two DREAM MCMCs, and a sequential Monte Carlo (SMC) particle filter.

Here an example how to run an MCMC on a likelihood function in BayesianTools

library(BayesianTools) setup = createBayesianSetup(likelihood, lower = c(-100,-100,0), upper = c(100,100,100)) out = runMCMC(setup, sampler = "DEzs", settings = NULL)

Obviously, you should read the help for details. The package supports most common summaries and diagnostics, including convergence checks, plots, and model selection indices, such as DIC, WAIC, or marginal likelihood (to calculate the Bayes factor).

gelmanDiagnostics(out) summary(out) plot(out, start = 1000) correlationPlot(out, start = 1000) marginalPlot(out, start = 1000) MAP(out) DIC(out) WAIC(out) # requires special definition of the likelihood, see help marginalLikelihood(out)

Below some example outputs obtained from calibrating the VSEM model, a simple ecosystem model that is provided as a test model in the package. The code to produce these plots is available when you type ?VSEM in R.

**References**

Hartig, F, Minunno, F, Paul, S (2017) BayesianTools: General-Purpose MCMC and SMC Samplers and Tools for Bayesian Statistics. R package version 0.1.3 [CRAN] [GitHub]

]]>Statistical fluctuations aside, it seems to me that the current situation is relatively stable. Global change / large scale journals such as GCB and GEB are still going strong, but it looks as if they are not growing as fast as in previous years. It might be too early to tell, but I’d venture a guess that the IF for these newer large-scale ecology fields will saturate over the next years. Also, it hurts me a bit to see that the IFs of the more theory-oriented journals such as AmNat and Oikos are still not really keeping up with the with the rest of ecology.

Rank ’16 |
Journal |
Publications |
IF’16 |
5-yr IF ’16 |

1 | TRENDS IN ECOLOGY & EVOLUTION | 72 | 15.27 | 18.35 |

2 | Annual Review of Ecology Evolution and Systematics | 22 | 10.18 | 14.57 |

3 | ISME Journal | 255 | 9.66 | 11.63 |

4 | ECOLOGY LETTERS | 146 | 9.45 | 13.33 |

5 | ECOLOGICAL MONOGRAPHS | 29 | 8.76 | 10.22 |

6 | GLOBAL CHANGE BIOLOGY | 311 | 8.5 | 9.46 |

7 | FRONTIERS IN ECOLOGY AND THE ENVIRONMENT | 51 | 8.04 | 10.84 |

8 | Molecular Ecology Resources | 121 | 7.33 | 6.54 |

9 | MOLECULAR ECOLOGY | 392 | 6.09 | 6.64 |

10 | GLOBAL ECOLOGY AND BIOGEOGRAPHY | 130 | 6.05 | 7.53 |

11 | JOURNAL OF ECOLOGY | 160 | 5.81 | 6.5 |

12 | WILDLIFE MONOGRAPHS | 3 | 5.75 | 5.22 |

13 | Methods in Ecology and Evolution | 160 | 5.71 | 8.63 |

14 | FUNCTIONAL ECOLOGY | 188 | 5.63 | 5.82 |

15 | JOURNAL OF APPLIED ECOLOGY | 186 | 5.3 | 5.99 |

16 | Advances in Ecological Research | 6 | 5.06 | 6.84 |

17 | PROCEEDINGS OF THE ROYAL SOCIETY B-BIOLOGICAL SCIENCES | 542 | 4.94 | 5.42 |

18 | ECOGRAPHY | 123 | 4.9 | 5.37 |

19 | CONSERVATION BIOLOGY | 123 | 4.84 | 5.09 |

20 | ECOLOGY | 325 | 4.81 | 5.77 |

21 | LANDSCAPE AND URBAN PLANNING | 134 | 4.56 | 5.02 |

22 | BULLETIN OF THE AMERICAN MUSEUM OF NATURAL HISTORY | 10 | 4.56 | 6.19 |

23 | JOURNAL OF ANIMAL ECOLOGY | 145 | 4.47 | 5.06 |

24 | DIVERSITY AND DISTRIBUTIONS | 110 | 4.39 | 5.27 |

25 | ECOLOGICAL APPLICATIONS | 202 | 4.31 | 4.93 |

26 | JOURNAL OF BIOGEOGRAPHY | 205 | 4.25 | 4.89 |

27 | EVOLUTION | 234 | 4.2 | 4.56 |

28 | ECOSYSTEMS | 100 | 4.2 | 4.78 |

29 | AMERICAN NATURALIST | 150 | 4.17 | 4.38 |

30 | AGRICULTURE ECOSYSTEMS & ENVIRONMENT | 486 | 4.1 | 4.68 |

31 | Ecosystem Services | 118 | 4.07 | 5.87 |

32 | OIKOS | 182 | 4.03 | 3.86 |

33 | BIOLOGICAL CONSERVATION | 328 | 4.02 | 4.55 |

34 | HEREDITY | 114 | 3.96 | 3.95 |

35 | Biogeosciences | 416 | 3.85 | 4.62 |

36 | Current Opinion in Insect Science | 97 | 3.66 | 3.66 |

37 | MICROBIAL ECOLOGY | 180 | 3.63 | 3.75 |

38 | LANDSCAPE ECOLOGY | 161 | 3.62 | 4.11 |

39 | BEHAVIORAL ECOLOGY | 216 | 3.31 | 3.24 |

1) Note that this post was published in May 2018, but backdated to June 2017 to better reflect the timing of events on my blog timeline

]]>

One potential reason for the low popularity of residual checks in Bayesian analysis may be that one has to code them by hand. I therefore wanted to point out that the DHARMa package (disclosure, I’m the author), which essentially creates the equivalent of posterior predictive simulations for a large number of (G)LM(M)s fitted with MLE, can also be used for Bayesian analysis (see also my earlier post about the package). When using the package, the only step required for Bayesian residual analysis is creating the posterior predictive simulations. The rest (calculating the Bayesian p-values, plotting, and tests) is taken care of by DHARMa.

I want to demonstrate the approach with a synthetic dataset of Beetle counts across an altitudinal gradient. Apart from the altitudinal preference of the species (in ecology called the niche), the data was created with a random intercept on year and additional zero-inflation (full code of everything I do at the end of the post).

Now, we might start in JAGS (or another Bayesian software for that matter) with a simple Poisson GLM, testing for counts ~ alt + alt^2, thus specifying a likelihood such as this one

for (i in 1:nobs) { lambda[i] <- exp(intercept + alt * altitude[i] + alt2 * altitude[i] * altitude[i]) beetles[i]~dpois(lambda[i]) }

To create posterior predictive simulations, add (and observe) the following chunk to the JAGS model code

for (i in 1:nobs) {

beetlesPred[i]~dpois(lambda[i])

}

The nodes beetlesPred are unconnected, so this will cause JAGS to simulate new observations, based on the current model prediction lambda (i.e. posterior predictive simulations).

We can now convert these simulations into into Bayesian p-values with the createDHARMa function. What this essentially does is to measure where the observed data falls on the distribution of simulated data (see code below, and the DHARMa vignette for more explanations of what that does). The resulting residuals are scaled between 0 and 1, and should be roughly uniformly distributed (the non-asymptotic distribution under H0:(model correct) might not be is not entirely uniform, but in simulations so far, I have not seen a single example where you would go seriously wrong assuming it is). Plotting the calculated residuals, we get:

As explained in the DHARMa vignette , this is how overdispersion looks like. As explained in the Vignette, we should now usually first investigate if there is a model misspecification problem, e.g. by plotting residuals against predictors per group. To speed things up, however, knowing that the issue is both a missing random intercept on year and zero-inflation, I have created a model that corrects for both issues.

So, here’s the residual check with the corrected (true model) – now, things looks fine.

When doing this in practice, I highly recommend to not only rely on the overview plots I used here, but also check

- Residuals against all predictors (use plotResiduals with myDHARMaObject$scaledResiduals)
- All previous plots split up across all grouping factors (e.g. plot, year)
- Spatial / temporal autocorrelation

which is all supported by DHARMa.

There is one additional subtlety, which is the question of how to create the posterior predictive simulations for multi-level models. In the example below, I create the simulations conditional on the fitted random effects and the fitted zero-inflation terms. Most textbook examples of posterior predictive simulations I have seen use this approach. There is nothing wrong with this, but one has to be aware that this doesn’t check the full model, but only the final random level, i.e. the Poisson part. The default for the MLE GLMMs in DHARMa is to re-simulate all random effects. I plan to discuss the difference between the two options in more detail in an extra post, but for the moment, let me say that, as a default, I recommend re-simulating the entire model also in a Bayesian analysis. An example for this extended simulations is here.

The full code for the simple (conditional posterior predictive simulation) example is here

]]>We are pleased to announce our 7th international summer school on Bayesian Modelling in the ecological and environmental sciences (see past courses here). This year the course will take place at the Heathland Centre on the beautiful island of Lygra in western Norway.

Bayesian inference is an increasingly used statistical framework in ecology and the environmental sciences but it is still rarely taught within the standard curriculum. This course provides a practical introduction to Bayesian inference covering both the theory and application of Bayesian methods using a number of examples motivated from the ecological and environmental sciences. The course is taught with early-career researchers in mind (PhD students and post-doctoral researchers) but we welcome applications from all researchers at any stage in their career.

The contents of the course will include:

- Introduction to the concepts of Bayesian statistics (priors, likelihoods, etc.)
- Sampling methods (Markov Chain Monte Carlo, rejection sampling, etc.)
- Bayesian modelling and hierarchical Bayesian models
- The BUGS model language and its implementations such as JAGS and OpenBUGS
- An introduction to other Bayesian modelling software such as STAN, BayesianTools, and INLA
- Demonstration of how to fit process-based/simulation models using Bayesian methods

The course will consist of lectures, practical exercises (with R and JAGS) and talks on advanced topics in Bayesian statistics. Course language will be English. This is an introductory course, but a basic knowledge of general statistics as well as a competency in R or another programming language will be highly beneficial to profit from this course.

The total price for the 5 day course is 5800 NOK (approximately 615€ or £520). This price includes accommodation from the 24^{th} September to the 29^{th} September, 3 meals on each of the full teaching days (25^{th} to 29^{th} September) and an evening meal on the night of the 24^{th} September. Tea, coffee, and light snacks will also be provided.

Applications to the course can be made here. Note that because we often have many applicants for the course we try to preferrentially offer course places to those participants who would benefit most. The application deadline to make the initial selection round is 30th June.

]]>Over the last years, I have been using null models more often than I liked. I had to, when there was no other way to figure out if an ecological pattern was unexpected, or trivial. Inspired by some recent (and also some older) posts, I thought I might throw around a few ideas that have been collecting dust in the back of my head, for what it’s worth.

Here is the summary of what I am going to say: on a philosophical level, nothing is *wrong* with null models. However, several things are *suboptimal* to the point of being *almost as bad as wrong*. Here’s my list, then I go through the points one by one, using examples from interaction network analyses :

- What exactly does a given null model control for? Some relevant quotes are “Null models will always be contentious.” and “Keep everything constant apart from the mechanism of interest.” (Gotelli & Graves 1996)
- Communicating null model reasoning is difficult: how is the result of a complex randomization algorithm to be interpreted ecologically?
- Coding null models can easily lead to errors: it may look like a null model, but is it an
*un**biased*null model? - Null models are only an in-between step: in the end, we really want a parametric model!

The idea of statistics, and null models in particular, is to investigate whether an observed pattern could have arisen by chance. In other words, we don’t want to be fooled into seeing patterns where there aren’t any (type I error).

But why do we need a null model in the first place? Why can we not interpret the observed pattern at face value, or use a standard statistical model? Let me try to explain with an example, loosely following the issue of Diamond’s (1975) birds-on-island story, later dissected ad infinitum (read the full story in Gotelli & Graves 1996). The observed pattern is that of a species-by-island matrix, with some species not co-occurring with others, and the question is if this pattern is random. Diamond calls it a checkerboard if on two islands two bird species occur only where the other does not (leading to a 01/10 pattern), and interprets this as a signal competition. Let’s say Diamond finds 22 checkerboards patterns for 10 birds on 20 islands. Is that lot (indicating strong imprints of competition on the co-occurence pattern), or to be expected given the bird abundances on these islands even without competitive effects?

The classical way to answer this question would be parametric statistics. In parametric statistics, we assume our data to be the result of a data-generating model, which consists of some systematic dependencies, combined with some random variates. If we cannot define or fit such a model, the null model comes in: the idea is to re-shuffle the data in some intelligent way to get an idea of what a random pattern would look like. Null modelling is thus a technique to get a feeling (a well-formalised feeling, that is) of what the data may look like *without* a systematic effect (usually hypothesised mechanism at work). The obvious place to look up definitions and examples is Gotelli & Graves (1996).

The key idea for our island bird problem (and many others) is that we must control for the fact that some species are rare and others are common. If a species is ubiquitous, it will have no “checkers”, and if it occurs on only one island, it can have maximally one “checker” (although with several species). Ecologically, prevalence may be related to generalism in feeding and nesting requirements, and/or dispersal abilities. The potential for checkerboards peaks at prevalences of half of the islands. So, one reason for a null model is to control for different prevalence of the species. (Gotelli et al. 2010 put predictors as weights into a null model; still this paper fails to convince me)

The same reasoning applies to the islands: some host many species, other only few. Again those with half the total number of species will have the highest potential for checkerboards, while islands hosting all or no species will have none. (Ecologically, this may suggest that some islands have a higher diversity of habitats, typically because they are larger.) So, another reason for a null model is to control for different habitat richness of islands.

We can implement this thinking by devising some kind of randomization mechanism that shuffles species identities around, preserving prevalences and island diversity, and from that deduce a null expectation for the checkerboard pattern, which we can compare with the observed checkerboard pattern.

However, this brings us straight to point 1.

All too often, I find it difficult to understand what ecological processes a null models controls for. We may know *why *we hypothesize a specific mechanism (say competition) to be behind a pattern. However, how can we be sure that a given randomization algorithm removes all but the effects of this mechanism? Actually, this is what the dispute about Diamond’s null models boils down to, and I don’t want to take sides, but as a first shot, as said above, I would be interested in randomising in such a way that each model retains the number of species it has, and that each species retains the number of islands it occupies. *Why*? Well, I would like to know *how much potential for the observed pattern *there is, given general constraints set by the data. If all randomisations that preserve prevalences and diversities lead to the same (observed) pattern, then clearly there is no additional competitive ingredient necessary to explain this pattern. (Note that this does not mean that species don’t interact. It “only” means that no interaction is necessary to yield this pattern.).

However, let me (and others) take issue with my own proposal: why should prevalence or habitat richness per island be considered constant? Is what we just set up really the right model to test for competition? I shall not attempt to answer this. It only serves to make my point: what do we actually control for?

My second point builds on the first: given that it is usually difficult to understand what is controlled in a null model, how do we communicate the results? The problem arises because ecologists use English, rather than mathematics, as language of communication. As a mathematical / algorithmic rule, a null model is perfectly well-defined. After having rejected such a null model, however, we have to translate this result into language and meaning.

I think it is already a major step forward if we recognise that these *two* issues need to be communicated. The next step could be to imagine the reader to disagree with our reasoning. He/She may think that we should allow all islands to be equal, without *a priori* difference (“neutral”). That would require a different null model. Well, so be it. One cannot anticipate all possible null models, but we can disclose the data and let people do whatever they want to do with it, preferably starting in the peer-review phase. In fact, the main criticism against Diamond that I have found really convincing is that he never released his original data.

So, next time you model nulls, give reason for the *why* and the *how, *and allow tests of robustness against critical assumptions.

The third point relates to the implementation, which is often non-trivial. I want to give an example from my own work.

For some project, we needed a null model for a matrix with pollinator visits to plants, which maintains marginal totals (i.e. the total number of visits per plant and per pollinator) *and *connectance (i.e. the number of nulls in the matrix). I followed the swap algorithm used for binary matrices, but modified it so that it also works for values other than 0 and 1. Without wanting to spend too much time on the details, it works like this: choose randomly two rows and columns. Add the minimum value of the counterdiagonal to the diagonal, and subtract it from the counterdiagonal (i.e. “move” the value). This creates a 0 on the counterdiagonal. Do this until the required number of 0s (connectance) is achieved. Sounds great – but it turns out that this is biased, in the sense that it does not keep constant some crucial properties of the data. The reason is that species with high abundance are less likely to have their numbers moved to the diagonal. This violates the idea that each cell should have the same probability of being subject to a “move”. I have not been able to correct this, so we resorted to using an additional null model (Vázquez 2005), which does not strictly maintain marginal totals. Luckily, the analyses we had done using the faulty null model (Dormann et al. 2009) were not qualitatively affected by the correction, but it taught me a lesson.

The point is: null models are highly specialized, so it’s likely that they will have to be hand-coded by an ecologist. Apart from the typographic errors, new code may be defective in a more subtle way as described above. I am ignorant of how to prevent such errors; the main reason is that we usually have no expectation of what a correct null model should deliver (otherwise I would not need it), so how can we test rigorously if a given code works correctly?

On some airport conversation a few years ago with Bob O’Hara, I was taken aback by his blunt statement that he doesn’t like null models. At that time, I was much in love with null models! Now, after a demonstration paper by Konstans Wells and Bob (2012), I finally understood his point – and agree. Null models are somewhat clumsy tools until we figured out a way to actually specify a parametric model.

In the Diamond story, we are really interested in whether specialists outcompete generalists, which make up by being better dispersers. So what we really want is an ecological model representing these processes, and then we want to fit it to the data , and while doing so correct for the effect of habitat diversity on differently sized islands, and for the traits of the species related to dispersal and and and . This dream model presents the actual ecology we’re interested in. Our data are likely to be too few, too noisy, too unspecific to fit such a model, but doesn’t that imply that also *no* null model will be able to address our question? And if there are enough data to inform the parameters in our dream model, doesn’t a highly constraint reshuffling of data in a null model seem an unnecessary circuitous way to the result?

Apart from the data problem, fitting complex stochastic models is also technically challenging. This point connects to another topic that has been discussed on this blog: Approximate Bayesian Computation (ABC). Fitting mechanistic models to (sets of) data can be a tedious little nightmare. But to my impression it is a much clearer and in the long run much less contentious way than null modelling.

Null models are here to stay for the immediate future, whether *I *like it or not. While this is the case, I guess the minimum standard would be to a) communicate the *aim* of the null model(s); b) communicate the idea of the *algorithm* of the null model; and c) provide *data and code* of the analysis. All this does not ensure we’ll be doing it correctly, but at least we err reproducibly.

Diamond, J.M. (1975) Assembly of species communities. *Ecology and Evolution of Communities* (eds M. Cody & J.M. Diamond), pp. 342–444. Belknap Press, Harvard, MA.

Dormann, C.F., Blüthgen, N., Fründ, J. and Gruber, B. (2009) Indices, graphs and null models: Analyzing bipartite ecological networks. *The Open Ecology Journal*, **2**, 7–24.

Gotelli, N.J. and Graves, G.R. (1996) *Null Models in Ecology*. Smithsonian Institution Press, Washington D.C. [available at the first author’s homepage for free, as book is out of print.]

Gotelli, N.J., Graves, G.R. and Rahbek, C. (2010) Macroecological signals of species interactions in the Danish avifauna. *Proceedings of the National Academy of Sciences of the USA*, **107**, 5030–5.

Vázquez, D.P. (2005) Degree distribution in plant–animal mutualistic networks: forbidden links or random interactions? *Oikos*, **108**, 421–426.

Wells K, O’Hara RB (2014) Species interactions: estimating per-individual interaction strength and covariates before simplifying data into per-species ecological networks. Methods Ecol Evol 4:1–8. doi: 10.1111/j.2041-210x.2012.00249.x

]]>