Bayesian, frequentist and statistical learning perspectives on penalising model complexity

In regression analysis, a common problem is to decide on the right functional form of the fitted model. On the one hand, we would like to make the model as flexible as possible so that it can adjust itself bias-free to the true data-generating process. On the other hand, the more freedom we give the model, the more uncertain parameter estimates get (= variance), which leads to total model error increasing with complexity after a certain point. This phenomenon, known as the bias-variance trade-off, leads to the insight that we should limit or penalise model complexity to get reasonable inferences. 

As a result of this insight, a large number of statistical approaches exist where we try to optimise an objective of the form:

Quality(M) = L(M) – complexityPenalty(M)

where M is the model, L(M) is the likelihood, and complexityPenalty(M) adds some penalty for the model’s complexity. Examples for this structure are information criteria such as the AIC / BIC, shrinkage estimations such as lasso / ridge (L1 / L2) penalty, or the wiggliness penalty in gams.

When these techniques are introduced in stats classes, they are usually motivated as a means to reduce overfitting, based on the arguments that I gave above. It is well-known (however, possibly less widely) that many of these penalties can be reinterpreted as a Bayesian prior. For example, shrinkage penalties such as the lasso (L1) or the ridge (L2) are equivalent to a double exponential respectively normal prior on the regression parameters (see Fig1). Likewise, wiggliness penalties in gams can be reinterpreted as priors on functional simplicity (see Miller, David L. (2019)).

Fig. 1: Equivalence of common regularisation penalties with Bayesian priors on the respective parameters. From Polson, N. G., & Sokolov, V. (2019), see also Park, T., & Casella, G. (2008).

One may therefore be tempted to re-interpret complexity penalties from statistical learning such as L1/L2 as an a-priori preference for simplicity, similar to Occam’s razor. This, however, misses an important point: in statistical learning, the strength of the penalty is usually estimated from data. L1/L2 complexity penalties, for example, are usually optimised via cross-validation. Thus, the simplicity preference in these statistical learning methods is not really a priori (what you would expect if we had a fundamental / scientific, data-independent preference for simplicity), but it is something that is adjusted adjusted from the data to optimise the bias-variance trade-off. Note also that, in low-data situations, the penalty may easily favour models that are far simpler than the truth.

This is the reason why classical L1/L2 regularisations are better interpreted as “empirical Bayesian” rather than fully Bayesian. Empirical Bayesian methods are methods that use the Bayesian framework for inference, but with priors that are estimated from data. Empirical and fully Bayesian perspectives can be switched or mixed though. One could, for example, add additional data-independent priors on simplicity in a model, and in some sense the common Bayesian practice of using “weakly informative” (data-independent) priors on regression parameters could be interpreted as a light fundamental preference of Bayesian for simplicity.

How does that help us in practice? Well, for example, I am a big fan of shrinkage estimators and would nearly always prefer them over variable selection. The reason why they are rarely used in ecology, however, is that frequentist regression packages that use shrinkage (such as glmnet) don’t calculate p-values. The reason is that obtaining calibrated p-values or CIs with nominal coverage for shrinkage estimators is hard, showing that the latter are probably better understood as a statistical learning method that optimises predictive error than a frequentist method that has controlled error rates. If we re-interpret the shrinkage estimator as a prior in a Bayesian analysis, however, we naturally get normal posterior estimates that can be interpreted pretty straightforward for inference. Thus, if you want to apply L1 / L2 penalties in a regression without loosing the ability to discuss the statistical evidence for an effect, just do it Bayesian!

References

Miller, David L. (2019) “Bayesian views of generalized additive modelling.” arXiv preprint arXiv:1902.01330 .

Polson, N. G., & Sokolov, V. (2019). Bayesian regularization: From Tikhonov to horseshoe. Wiley Interdisciplinary Reviews: Computational Statistics11(4), e1463.

Park, T., & Casella, G. (2008). The bayesian lasso. Journal of the American Statistical Association103(482), 681-686

3 thoughts on “Bayesian, frequentist and statistical learning perspectives on penalising model complexity

  1. Hi Florian,

    Thanks for the post!
    I am quite new to shrinkage modeling – most of the studies I’ve done so far on ecology are either based on model selection of have some Bayesian elements – even though very basic, just to allow a little more flexibility over the frequentis approach.

    I’ve been collaborating with statisticians who recommend and use methods with shrinkage (Lasso, Ridge), but they are still kind of “mysterious” in practice. Do you have any begginner’s reading suggestion on that?

    Once this basics are understood, having a Bayesian version of that seems quite interesting!

    Like

    • Hi Bernado,

      as shown in the picture, a Ridge regression is mathematically the same as a Bayesian regression with a normal prior on the regression slopes. The width of the prior is then the shrinkage penalty, the smaller the prior width, the larger the penalty.

      The only difference from the Bayesian perspective is how you set the width of the prior (= shrinkage penalty). I would say that there are 2-3 solutions in practice:

      1) set a light shrinkage penalty a priori (this is know as weakly informative priors). If you do this, most people don’t even use the word shrinkage, but effectively, if you do, you have to be less worried about overfitting / parameter selection in a GLMM setting

      2) set a stronger shrinkage prior, and get the value from something else (e.g. cross-validation). This is rarely done in my experience

      3) set adaptive shrinkage priors, where you make the shrinkage another parameter that is estimated, and set a common prior for the parameters. See for example https://mc-stan.org/rstanarm/reference/priors.html#hierarchical-shrinkage-family

      I don’t really have a good review at hand. I would suggest to google for Bayesian shrinkage prior STAN or so to get examples with code.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s