Last week, I gave a seminar about MCMC chain analysis and convergence diagnostics with coda in R, and I thought a summary would make a nice post.

As a prerequisite, we will use a few lines of code, very similar to a previous post on MCMC sampling. In the code, we create some test data in which a dependent variable y depends linearly on an independent variable x (predictor); define likelihood and priors for a linear model to be fit to the data; and implement a simple Metropolis-Hastings MCMC to sample from the posterior distribution of this model. It’s actually not important for the things that follow that you understand in detail what’s going on here, but for those who do not feel at ease about this, have a look at my previous post.

trueA = 5 trueB = 0 trueSd = 10 sampleSize = 31 x = (-(sampleSize-1)/2):((sampleSize-1)/2) y = trueA * x + trueB + rnorm(n=sampleSize,mean=0,sd=trueSd) likelihood = function(param){ a = param[1] b = param[2] sd = param[3] pred = a*x + b singlelikelihoods = dnorm(y, mean = pred, sd = sd, log = T) sumll = sum(singlelikelihoods) return(sumll) } prior = function(param){ a = param[1] b = param[2] sd = param[3] aprior = dunif(a, min=0, max=10, log = T) bprior = dnorm(b, sd = 5, log = T) sdprior = dunif(sd, min=0, max=30, log = T) return(aprior+bprior+sdprior) } proposalfunction = function(param){ return(rnorm(3,mean = param, sd= c(0.1,0.5,0.3))) } run_metropolis_MCMC = function(startvalue, iterations){ chain = array(dim = c(iterations+1,3)) chain[1,] = startvalue for (i in 1:iterations){ proposal = proposalfunction(chain[i,]) probab = exp(likelihood(proposal)+ prior(proposal) - likelihood(chain[i,])- prior(chain[i,])) if (runif(1) &lt; probab){ chain[i+1,] = proposal }else{ chain[i+1,] = chain[i,] } } return(mcmc(chain)) }

So, let’s run the MCMC:

startvalue = c(4,2,8) chain = run_metropolis_MCMC(startvalue, 10000)

### Some simple summaries of the chain facilitated by coda

The run_metropolis_MCMC() function returns the posterior sample as a coda object of class “mcmc” (this is done in the line “return(mcmc(chain))”).

What’s the sense of this line? Well, coda is an R package that provides a number of standard functions for plotting and analysis of the posterior samples. For those functions to work, you need to have your output as an object of class “mcmc”, or “mcmc.list”, which we will discuss later. Coda is the standard package for this type of analysis, and most Bayesian packages in R use this class to return MCMC outputs, so you will likely come across this syntax whatever Bayesian code you are running.

Objects of class “mcmc” hold and array with the mcmc samples, and a number of additional information. You can look at the structure with str(chain), and you can transform a “mcmc” object back to a normal data-frame by data.frame(chain).

The advantage of having a coda object is that a lot of things that we typically want to do with the chain are already implemented, so for example we can simply summary() and plot() the outputs

summary(chain) plot(chain)

which gives some useful information on the console and a plot that should look roughly like this:

**Figure:** Results of the plot() function of a coda object

I think the summary is self-explanatory (otherwise check out the help), but maybe a few words on the results of the plot() function: each row corresponds to one parameter, so there a are two plots for each parameter. The left plot is called a trace plot – it shows the values the parameter took during the runtime of the chain. The right plot is usually called a marginal density plot. Basically, it is the (smoothened) histogram of the values in the trace-plot, i.e. the distribution of the values of the parameter in the chain.

### Marginal densities hide correlations

Marginal densities are an average over the values a parameter takes with all other parameters “marginalized”, i.e. other parameters having any values according to their posterior probabilities. Often, marginal densities are treated as the main output of a Bayesian analysis (e.g. by reporting their mean and sd of), but I strongly advice against this practice without further analysis. The reason is that marginal densities “hide” correlations between parameters, and if there are correlations, parameter uncertainties appear to be much greater in the marginals that they actually are. To check for pairwise correlations is quite easy – just use pairs on the MCMC chain:

pairs(data.frame(chain))

or use this code snippet which produces the “nicer” pair correlation plot that you can see below.

Addition: here’s a post with further options for creating correlation plots.

In our case, there should be no large correlations because I set up the example in that way, but we can easily achieve a correlation between slope and intercept by “uncentering” our data, that is, having x-values that are not centered around 0. To see this, replace in the first large code fragment the creation of the test data by this line which creates non-centered x-values

x = (-(sampleSize-1)/2):((sampleSize-1)/2) + 20

Running the MCMC again and checking for correlations should give you a quite different picture.

**Figure:** Marginal densities (diagonal), pairwise densities (lower panes) and correlation coefficien (upper panels) for the fit with unbalanced x-values

You can see the strong correlation between the first and the second parameter (slope and intercept), and you can also see that your marginal uncertainty for ~~each~~ parameter X2 has increased (addition: see comments below, surprisingly the slope parameters shrinks). In any case, what is really important to understand is that the fit / predictive ability of the model has not fundamentally changed – the Bayesian analysis doesn’t care if you shift the x-values by 20 to the right. However, marginally, the distributions look different now, and you see different marginal uncertainties. You will only arrive at the correct (unchanged) conclusions if you take the correlations into account. For example, it doesn’t make sense any more to say that the slope has a value of x +/- sd because this misses that point that for any given parameter of the intercept, the uncertainty of the slope is much smaller. For that reason, one should always check the correlations, and if possible, one should try to avoid correlations between parameters because this makes the analysis easier.

Note that we only checked for pairwise correlations here, there may still be higher order interactions that don’t show up in an analysis like that, so you may still be missing something. For that reason, the advice is to summarize the chain only when really necessary, otherwise things like the prior predictive distribution etc. should always be created by sampling directly from the chain.

### Convergence diagnostics

Now, to the convergence: an MCMC creates a sample from the posterior distribution, and we usually want to know whether this sample is sufficiently close to the posterior to be used for analysis. There are several standard ways to check this, but I recommend the Gelman-Rubin diagnostic (check the coda help for other options that are implemented). Basically, Gelman-Rubin measures whether there is a significant difference between the variance within several chains and the variance between several chains by a value that is called “scale reduction factors”. To do this, we obviously need a second chain, and then simply run the commands:

chain2 = run_metropolis_MCMC(startvalue, 10000) combinedchains = mcmc.list(chain, chain2) plot(combinedchains) gelman.diag(combinedchains) gelman.plot(combinedchains)

The plot should look something like this

**Figure:** Results of the gelman.plot() function

The gelman.diag gives you the scale reduction factors for each parameter. A factor of 1 means that between variance and within chain variance are equal, larger values mean that there is still a notable difference between chains. Often, it is said that everything below 1.1 or so is OK, but note that this is more a rule of thumb. The gelman,plot shows you the development of the scale-reduction over time (chain steps), which is useful to see whether a low chain reduction is also stable (sometimes, the factors go down and then up again, as you will see). Also, note that for any real analysis, you have to make sure to discard any bias that arises from the starting point of your chain (burn-in), typical values here are a few 1000-10000 steps. The gelman plot is also a nice tool to see roughly where this point is, that is, from which point on the chains seem roughly converged.

The **complete code for this example** is available here.

### Improving convergence / mixing

So, what to do if there is no convergence yet? Of course, you can always run the MCMC longer, but the other option is to make it converge faster … the word that is used here is “mixing”, which basically means how well the algorithm jumps around in the parameter space … the mixing is affected by the choice of your proposal function. Two things can happen:

- Your proposal function is narrow compared to the distribution we sample from – high acceptance rate, but we don’t get anywhere, bad mixing
- Your proposal function is too wide compared to the distribution we sample from – low acceptance rate, most of the time we stay where we are

**Figure**: problems in the proposal function and how they show up in trace-plots

As displayed in the figure, these problems can be seen in the trace plots. Theoretical considerations show that an acceptance rate of 20-30% is optimal for typical target distributions, but this is in practice not so helpful because you can still have very bad mixing although being at this level by having the proposal of one parameter too narrow and the proposal of another parameter too wide. Better to look at the trace-plots of the individual parameters. Again, correlations are a problem, so if you have strong correlations in parameter space, you can get bad mixing. Using multivariate proposals that are adjusted to the correlation structure can help, but in general it is better to avoid correlations if possible. The good news at last: most of the more “professional” MCMC sampling software such as Jags or WinBugs will do these things automatically for you.

Pingback: A simple Metropolis-Hastings MCMC in R « theoretical ecology

Pingback: Monday, April 30 | Julie I. Santos

Pingback: A simple Approximate Bayesian Computation MCMC (ABC-MCMC) in R « theoretical ecology

Reblogged this on My Notepad.

Pingback: Explaining the ABC-Rejection Algorithm in R | theoretical ecology

Pingback: Explaining the ABC-Rejection Algorithm in R ← Patient 2 Earn

Thanks a lot for this excellent post. very helpful.

Pingback: Making the Most of MCMC | Sweet Tea, Science

Complete code of the example is missing. The given link is broken.

Hi Florian,

You haven’t included the burnIn in the code. burnIn is not required while using coda or you just ignored it?

Thank you,

MS

Thanks for broken link, I fixed this. In coda and most other packages, burn-in is not done automatically, so if you don’t specify it the whole chain will be used. It depends on what you want to do with the chain if you discard burn-in (for sure, you want to discard if you want to estimate posterior summaries). In this example, I didn’t discard it so that you can see the full trace / Gelman plots.

Hi Florian,

I have a question about the section of Marginal densities hide correlations.

Typically, people talk about the correlation of input variables, not the parameters. I am puzzled that how do we know the parameters are correlated and what’s the high correlation of parameters supposed to mean.

By running the code, i got a narrower bandwidth for parameter 1 and a wider bandwidth for parameter 2. This is counter-intuitive by reading your post because we are supposed to get a wider bandwidth for both parameters, correct? Can you please shed more lights on it?

Thank you very much.

Lizy

Hi Lizy,

when I say that there is a correlation between parameters, what I mean is that we can maintain a good fit / likelihood / posterior by increasing parameter X1 when we decrease parameter X2 at the same time. You could also call this a trade-off between the two parameters.

Such trade-offs can appear through a number of reasons. ONE important reason is correlations between the predictors (collinearity). Other possible reasons are a) the model structure (as in the example with the intercept), or b) simply that the problem is underdetermined (not enough data to constrain the parameters).

You may be right that people talk more about collinearity (correlation in predictor space), but identifying trade-offs in parameter space is crucial for a correct interpretation of the uncertainty and for correct forecasting. This is independent of whether you are working Bayesian or not. Also lme4, for example, returns the covariance matrix of the likelihood density (i.e. the correlation of the parameters) as a standard output with their summary() function.

About the width – I just reran it and yes, the marginal uncertainty for the slope shrinks. I have to think a bit more about why this happens, I find it somewhat counterintuitive. In any case, what we should keep in mind is that the true uncertainty of the combination of slope / intercept is smaller than in appears marginally, because particular combinations of the two are not compatible with the data.

Hi Florian,

first of all thanks for the great articles on McMC sampling in R. They have been of great help so far also for an R-newbie like me.

I would like to ask you in it’s possible to use CODA or other packages in order to run parallel independent chains using a multicore system and then combine the chains. Thanks a lot.

If you want to use the code from this post, just use the standard parallel functions of R to parallelize a for loop / apply for the chains, collect the results, and put the chains together.

For the more serious MCMC packages for Bayesian analysis (e.g. Jags, STAN) there are options in the R interface to do this.