There is this confusion about the treatment of spatial autocorrelation in statistical models, which is highlighted somewhat involuntarily, but as I find very interestingly, in a recent article by Bradford Hawkins in the Journal of Biogeography and a reply by Ingolf Kühn and Carsten Dorman which just came out in May’s issue of the same journal. As a disclaimer, I’m strictly taking sides with Ingolf and Carsten on this one, but I’ll do my best to provide a fair account of the problem.
So, what’s the deal? First of all, we should define what we mean by spatial autocorrelation (SA)1. Spatial autocorrelation in a spatial dataset refers to the phenomenon than the variation between the values of datapoints is affected by their spatial distance.
Spatial autocorrelation occurs frequently in ecological datas, the underlying reason being that many drivers of ecological patterns such as climate or soil that act at larger scales that those at which data is sampled, making close datapoints more similar than distant ones. Thus, SA is a perfectly natural thing that arises from environmental or ecological processes that act above the sampling scale. There is, however, overwhelming agreement among statisticians that spatial autocorrelation in statistical models is a problem that needs to be corrected by appropriate methods. And this creates the confusion, because some people, among them Hawkins, seem to have a problem with the notion of “bad” spatial autocorrelation. Hawkins, e.g., states that
If spatial autocorrelation is part of nature, and we are trying to understand nature, it
makes little sense to claim that spatial autocorrelation in data represents some sort of bias, artefact or distortion.
Hawkins never clearly defines which data he is speaking about, but the way he refers to “spatial autocorrelation in the data” suggests that he is thinking about spatial autocorrelation in the raw data, i.e. in the response variable that we try to explain with a statistical model. And here’s the thing: this is not the autocorrelation statisticians are concerned with. The point of concern for statisticians is what is called residual spatial autocorrelation (RSA), which is the autocorrelation of the residuals between model predictions and data. Kühn and Dormann:
We believe that within the arguments presented by Hawkins (2012), he has confounded the occurrence of SA in the raw data with SA in the residuals. If the spatial autocorrelation of an ecological response variable is caused by autocorrelated predictor variables (such as climate, land use, topography, human population densities or virtually any other spatial predictor), we are not alarmed. Of course we do not wish to remove this effect of such predictors. [...] SA in the residuals is, however, a serious problem, because it
(1) indicates the violation of an independence assumption of any statistical model, be it regression or CART (classification and regression trees), resulting in incorrect error probabilities; and
(2) can seriously affect coefficient estimates.
To add some remarks to that: one potential source for RSA is the part of SA in the raw data that is not explained by the model. However, this is not the only source. If environmental predictors show SA, model predictions will too, and this can cause RSA even if there is no SA in the raw data. Thus, RSA might reflect unexplained natural processes, but it may also simply reflect model error.
My conclusion on this exchange:
- The ecological literature has been far too sloppy when distinguishing between SA in the data and RSA. In fact, most studies, when referring to RSA, simply use the word SA. Strictly speaking about RSA rather than SA will avoid confusion
- RSA is per definition a violation of statistical assumptions. This does not mean that inferential results MUST be biased, but they CAN be. Therefore, something MUST be done if strong RSA is detected.
- About WHAT must be done, the statistical literature may have been a bit fast in promoting phenomenological add-ons to regression models that basically “absorb” RSA – maybe, we should more often try to invest some time in understanding the ecological reason for RSA and remove it by modifying the model, e.g. through including additional predictor variables or choosing a different functional form. Reinterpreted like that, there is some merit in the critique on the current treatment of RSA. However, RSA that cannot be removed by improving the models still must be corrected because of 2)
1) pretty much everything said here also applies to temporal autocorrelation if “space” would be replaced by “time”.