Spatial autocorrelation in statistical models – friend or foe?

There is this confusion about the treatment of spatial autocorrelation in statistical models, which is highlighted somewhat involuntarily, but as I find very interestingly, in a recent article by Bradford Hawkins in the Journal of Biogeography and a reply by Ingolf Kühn and Carsten Dorman which just came out in May’s issue of the same journal. As a disclaimer, I’m strictly taking sides with Ingolf and Carsten on this one, but I’ll do my best to provide a fair account of the problem.

So, what’s the deal? First of all, we should define what we mean by spatial autocorrelation (SA)1. Spatial autocorrelation in a spatial dataset refers to the phenomenon than the variation between the values of datapoints is affected by their spatial distance.

Spatial autocorrelation occurs frequently in ecological datas, the underlying reason being that many drivers of ecological patterns such as climate or soil that act at larger scales that those at which data is sampled, making close datapoints more similar than distant ones. Thus, SA is a perfectly natural thing that arises from environmental or ecological processes that act above the sampling scale. There is, however, overwhelming agreement among statisticians that spatial autocorrelation in statistical models is a problem that needs to be corrected by appropriate methods. And this creates the confusion, because some people, among them Hawkins, seem to have a problem with the notion of “bad” spatial autocorrelation. Hawkins, e.g., states that

If spatial autocorrelation is part of nature, and we are trying to understand nature, it
makes little sense to claim that spatial autocorrelation in data represents some sort of bias, artefact or distortion.

Hawkins never clearly defines which data he is speaking about, but the way he refers to “spatial autocorrelation in the data” suggests that he is thinking about spatial autocorrelation in the raw data, i.e. in the response variable that we try to explain with a statistical model. And here’s the thing: this is not the autocorrelation statisticians are concerned with. The point of concern for statisticians is what is called residual spatial autocorrelation (RSA), which is the autocorrelation of the residuals between model predictions and data. Kühn and Dormann:

We believe that within the arguments presented by Hawkins (2012), he has confounded the occurrence of SA in the raw data with SA in the residuals. If the spatial autocorrelation of an ecological response variable is caused by autocorrelated predictor variables (such as climate, land use, topography, human population densities or virtually any other spatial predictor), we are not alarmed. Of course we do not wish to remove this effect of such predictors. [...] SA in the residuals is, however, a serious problem, because it
(1) indicates the violation of an independence assumption of any statistical model, be it regression or CART (classification and regression trees), resulting in incorrect error probabilities; and
(2) can seriously affect coefficient estimates.

To add some remarks to that: one potential source for RSA is the part of SA in the raw data that is not explained by the model. However, this is not the only source. If environmental predictors show SA, model predictions will too, and this can cause RSA even if there is no SA in the raw data. Thus, RSA might reflect unexplained natural processes, but it may also simply reflect model error.

My conclusion on this exchange:

  1. The ecological literature has been far too sloppy when distinguishing between SA in the data and RSA. In fact, most studies, when referring to RSA, simply use the word SA. Strictly speaking about RSA rather than SA will avoid confusion
  2. RSA is per definition a violation of statistical assumptions. This does not mean that inferential results MUST be biased, but they CAN be. Therefore, something MUST be done if strong RSA is detected.
  3. About WHAT must be done, the statistical literature may have been a bit fast in promoting phenomenological add-ons to regression models that basically “absorb” RSA – maybe, we should more often try to invest some time in understanding the ecological reason for RSA and remove it by modifying the model, e.g. through including additional predictor variables or choosing a different functional form. Reinterpreted like that, there is some merit in the critique on the current treatment of RSA. However, RSA that cannot be removed by improving the models still must be corrected because of 2)
References

Hawkins, B. A. (2012) Eight (and a half) deadly sins of spatial analysis. Journal of Biogeography, 39, 1-9.

Kühn, I. & Dormann, C. F. (2012) Less than eight (and a half) misconceptions of spatial analysis. Journal of Biogeography, 39, 995-998.

1) pretty much everything said here also applies to temporal autocorrelation if “space” would be replaced by “time”.

About these ads
6 comments
  1. jebyrnes said:

    One of the most frustrating comments I get periodically is, “You can’t use those predictors that way, they’re spatially/temporally correlated. You need to detrend them first.” And yet, that’s the point. That spatial or temporal variation in a predictor is the variation needed to get a signal of said predictor. By detrending any signal from a predictor that is spatially or temporally distributed, you throw the baby out with the bathwater. Frustrating. For analyses where I know I may be missing the signal of some driver that is spatially distributed, I’ll often use Moran’s I on the residuals and correct accordingly. But detrending or otherwise correcting predictors? That way leads to removing very real biological signals and incorrect answers in my experience.

    • Yes, reviewers can be an annoying species ;) Joking apart, I have no problem to recognize that methods to account for RSA come with their own problems, and I agree that detrending is a particularly problematic one.

      However, I think it’s the wrong message to conclude from this that RSA is no problem and can safely be ignored. In some cases and depending on what you want to get out from the analysis, it might be OK not correct to RSA – there is, e.g., a certain chance that effect sizes and predictions are more or less unbiased. However, some inferential “products”, in particular p-values, (Bayesian) CIs, AIC values etc., are nearly inevitably biased by strong RSA, so if you want to report them you will have to correct for RSA in some way, or you simply have to realize that the analysis you want to do can’t be done – there are limits to what you can legitimately extract from data.

      • jebyrnes said:

        I agree – correction for RSA is key, as there is real biological meaning in RSA. It’s the implication that the autocorrelation needs to be somehow ‘corrected’ for before the analysis even begins that is problematic – i.e., detrending predictors and then only using the detrended predictors in an analysis. That’s dealing with SA, and throwing the baby out with the bathwater. After an analysis, though, a thorough examination of RSA or RTA is definitely warranted so that things like p-values can be calculated correctly.

        To deal with the red herring that also pops up after the above paragraph, I often hear ‘well, what if your predictors happens to covary spatially with the true driver of a process’ – as if that is unique to analysis of field data that have a spatial component. That’s a problem with any analysis that deals with unmanipulated data, and requires careful consideration when building one’s causal model – a much larger issue that is not unique to spatial or temporal data.

        • The detrending topic really is a contentious one – as you say, you have to be sure that you’re not removing your signal, so I think its advisable not to promote this as an automatism, but I guess if you know your data and what you’re doing, it’s justifiable, acknowledging that detrending imposes additional (and potentially consequential) assumptions which your inference is now conditioned on.

          You’re right, the second comment is a red herring.

          Cheers for the comments, I appreciate it! btw, I realized you’re co-running the SciFund challenge. I’ve been following that for a while (without getting involved myself I have to confess), I think that’s really a great initiative for connecting researchers with the public.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 99 other followers