There is this confusion about the treatment of spatial autocorrelation in statistical models, which is highlighted somewhat involuntarily, but interestingly as I find, in a recent article by Bradford Hawkins in the Journal of Biogeography and a reply by Ingolf Kühn and Carsten Dorman which just came out in May’s issue of the same journal. As a disclaimer, I’m strictly taking sides with Ingolf and Carsten on this one, but I’ll do my best to provide a fair account of the problem.
So, what’s the deal? First of all, we should define what we mean by spatial autocorrelation (SA)1. Spatial autocorrelation in a spatial dataset refers to the phenomenon than the variation between the values of datapoints is affected by their spatial distance.
Spatial autocorrelation occurs frequently in ecological data, the underlying reason being that many drivers of ecological patterns such as climate or soil act at large spatial scales, making spatially close datapoints more similar than distant ones. Thus, SA arises in a perfectly natural way from environmental or ecological processes that act above the sampling scale.
On the other hand, there is overwhelming agreement among statisticians that spatial autocorrelation in statistical models is a problem that needs to be corrected by appropriate methods. And this creates the confusion, because some people, among them Hawkins, seem to have a problem with the notion of “bad” spatial autocorrelation. Hawkins, e.g., states that
If spatial autocorrelation is part of nature, and we are trying to understand nature, it makes little sense to claim that spatial autocorrelation in data represents some sort of bias, artefact or distortion.
Hawkins never clearly defines which data he is speaking about, but the way he refers to “spatial autocorrelation in the data” suggests that he is thinking about spatial autocorrelation in the raw data, i.e. in the response variable that we try to explain with a statistical model. And here’s the thing: this is not the autocorrelation statisticians are concerned with. The point of concern for statisticians is what is called residual spatial autocorrelation (RSA), which is the autocorrelation of the residuals between model predictions and data. Kühn and Dormann:
We believe that within the arguments presented by Hawkins (2012), he has confounded the occurrence of SA in the raw data with SA in the residuals. If the spatial autocorrelation of an ecological response variable is caused by autocorrelated predictor variables (such as climate, land use, topography, human population densities or virtually any other spatial predictor), we are not alarmed. Of course we do not wish to remove this effect of such predictors. […] SA in the residuals is, however, a serious problem, because it
(1) indicates the violation of an independence assumption of any statistical model, be it regression or CART (classification and regression trees), resulting in incorrect error probabilities; and
(2) can seriously affect coefficient estimates.
To add some remarks to that: one potential source for RSA is the part of SA in the raw data that is not explained by the model. However, this is not the only source. If environmental predictors show SA, model predictions will too, and this can cause RSA even if there is no SA in the raw data. Thus, RSA might reflect unexplained natural processes, but it may also simply reflect model error.
My conclusion on this exchange:
- The ecological literature has been far too sloppy when distinguishing between SA in the data and RSA. In fact, most studies, when referring to RSA, simply use the word SA. Strictly speaking about RSA rather than SA will avoid confusion
- RSA is per definition a violation of statistical assumptions. This does not mean that inferential results MUST be biased, but they CAN be. Therefore, something MUST be done if strong RSA is detected.
- About WHAT must be done, the statistical literature may have been a bit fast in promoting phenomenological add-ons to regression models that basically “absorb” RSA – maybe, we should more often try to invest some time in understanding the ecological reason for RSA and remove it by modifying the model, e.g. through including additional predictor variables or choosing a different functional form. Reinterpreted like that, there is some merit in the critique on the current treatment of RSA. However, RSA that cannot be removed by improving the models still must be corrected because of 2)
Hawkins, B. A. (2012) Eight (and a half) deadly sins of spatial analysis. Journal of Biogeography, 39, 1-9.
Kühn, I. & Dormann, C. F. (2012) Less than eight (and a half) misconceptions of spatial analysis. Journal of Biogeography, 39, 995-998.
1) pretty much everything said here also applies to temporal autocorrelation if “space” would be replaced by “time”.
11 thoughts on “Spatial autocorrelation in statistical models – friend or foe?”
One of the most frustrating comments I get periodically is, “You can’t use those predictors that way, they’re spatially/temporally correlated. You need to detrend them first.” And yet, that’s the point. That spatial or temporal variation in a predictor is the variation needed to get a signal of said predictor. By detrending any signal from a predictor that is spatially or temporally distributed, you throw the baby out with the bathwater. Frustrating. For analyses where I know I may be missing the signal of some driver that is spatially distributed, I’ll often use Moran’s I on the residuals and correct accordingly. But detrending or otherwise correcting predictors? That way leads to removing very real biological signals and incorrect answers in my experience.
Yes, reviewers can be an annoying species 😉 Joking apart, I have no problem to recognize that methods to account for RSA come with their own problems, and I agree that detrending is a particularly problematic one.
However, I think it’s the wrong message to conclude from this that RSA is no problem and can safely be ignored. In some cases and depending on what you want to get out from the analysis, it might be OK not correct to RSA – there is, e.g., a certain chance that effect sizes and predictions are more or less unbiased. However, some inferential “products”, in particular p-values, (Bayesian) CIs, AIC values etc., are nearly inevitably biased by strong RSA, so if you want to report them you will have to correct for RSA in some way, or you simply have to realize that the analysis you want to do can’t be done – there are limits to what you can legitimately extract from data.
I agree – correction for RSA is key, as there is real biological meaning in RSA. It’s the implication that the autocorrelation needs to be somehow ‘corrected’ for before the analysis even begins that is problematic – i.e., detrending predictors and then only using the detrended predictors in an analysis. That’s dealing with SA, and throwing the baby out with the bathwater. After an analysis, though, a thorough examination of RSA or RTA is definitely warranted so that things like p-values can be calculated correctly.
To deal with the red herring that also pops up after the above paragraph, I often hear ‘well, what if your predictors happens to covary spatially with the true driver of a process’ – as if that is unique to analysis of field data that have a spatial component. That’s a problem with any analysis that deals with unmanipulated data, and requires careful consideration when building one’s causal model – a much larger issue that is not unique to spatial or temporal data.
The detrending topic really is a contentious one – as you say, you have to be sure that you’re not removing your signal, so I think its advisable not to promote this as an automatism, but I guess if you know your data and what you’re doing, it’s justifiable, acknowledging that detrending imposes additional (and potentially consequential) assumptions which your inference is now conditioned on.
You’re right, the second comment is a red herring.
Cheers for the comments, I appreciate it! btw, I realized you’re co-running the SciFund challenge. I’ve been following that for a while (without getting involved myself I have to confess), I think that’s really a great initiative for connecting researchers with the public.
Pingback: Should we refer to stochasticity as “error”? « theoretical ecology
Pingback: Do we need to derive mechanistic error distributions for deterministic models? | Just Simple Enough: The Art of Mathematical Modelling
Hi Florian, Nice article help me to better understand the differences between SA in the data and RSA. I found myself dealing with some strong SA in my data when fitting BRTs and GAMMs, and accounting for SA to remove any spatial structure in the residuals did in fact change the significance level of some variables leading to a simpler model. Now, do you have a good reference for that paragraph when you detail potential sources of RSA? If you did it would be fantastic, thanks!
“To add some remarks to that: one potential source for RSA is the part of SA in the raw data that is not explained by the model. However, this is not the only source. If environmental predictors show SA, model predictions will too, and this can cause RSA even if there is no SA in the raw data. Thus, RSA might reflect unexplained natural processes, but it may also simply reflect model error. “
Hi Pablo – no, not really. We had a MSc thesis running a while ago looking at what kind of RSA different factors produce, but this isn’t citable (yet). It’s common sense though that leaving out a predictor with SA, or having the wrong functional relationship for this predictor will create RSA, unless this predictor is collinear with some other predictors and can be replaced by those. The only thing close to what you want that I’m aware of would be
van Teeffelen, A. & Ovaskainen, O. (2007) Can the cause of aggregation be inferred from species distributions?. Oikos, 116, 4-16.
Some more general arguments are in
Dormann, C. et al. (2007) Methods to account for spatial autocorrelation in the analysis of species distributional data: a review. Ecography, 30, 609-628.
Pingback: Autocorrelation in ecology - Gwen Antell
Thank you very much for this informative article, Florian! I just have a couple questions: if I add a spatial autocorrelation term to a statistical model that also has interesting, spatially-correlated explanatory variables, does that mean I am accounting for things that are unmeasured to make sure that I don’t have spatial autocorrelation in the residuals? In other words, by adding the spatial autocorrelation term I won’t take away the effect of the interesting spatial autocorrelation of the explanatory variables, right?
as I hinted to in the post, assuming we have a perfectly fitting model, spatial autocorrelation in the predictors should not create a spatial residual pattern, so adding a spatial term shouldn’t make a difference. Moreover, because the spatial autocorrelation of the predictors is independent of the model, any interpretation of this signal will be unaffected by the model that you fit,
What you account for when you add a spatial term is RSA, and this may be caused by unmeasured spatial predictors, an inadequate functional response for a measured spatial predictor, or some other error in the model that depends on space.
That is of course only the theory. In practice, in a small data set, making the model more complex by adding a spatial term can create all kinds of overfitting problems, including that you weaken or absorb the effect of a (spatial) predictor of interest.