While there are certainly good software packages out there to do the job for you, notably BUGS or JAGS, it is instructive to program a simple MCMC yourself. In this post, I give an educational example of the Bayesian equivalent of a linear regression, sampled by an MCMC with Metropolis-Hastings steps, based on an earlier version which I did to together with Tamara Münkemüller. Since first publishing this post, I have made a few small modifications to improve clarity. A similar post on Metropolis-Hastings MCMC algorithms by Darren Wilkinson is also worth looking at. More on analyzing the results of this algorithm can be found in a recent post.

**Creating test data**

As a first step, we create some test data that will be used to fit our model. Let’s assume a linear relationship between the predictor and the response variable, so we take a linear model and add some noise.

trueA <- 5 trueB <- 0 trueSd <- 10 sampleSize <- 31 # create independent x-values x <- (-(sampleSize-1)/2):((sampleSize-1)/2) # create dependent values according to ax + b + N(0,sd) y <- trueA * x + trueB + rnorm(n=sampleSize,mean=0,sd=trueSd) plot(x,y, main="Test Data")

I balanced x values around zero to “de-correlate” slope and intercept. The result should look something like the figure to the right

**Defining the statistical model**

The next step is to specify the statistical model. We already know that the data was created with a linear relationship y = a*x + b between x and y and a normal error model N(0,sd) with standard deviation sd, so let’s use the same model for the fit and see if we can retrieve our original parameter values.

**Derive the likelihood function from the model**

For estimating parameters in a Bayesian analysis, we need to derive the likelihood function for the model that we want to fit. The likelihood is the probability (density) with which we would expect the observed data to occur conditional on the parameters of the model that we look at. So, given that our linear model y = b + a*x + N(0,sd) takes the parameters (a, b, sd) as an input, we have to return the probability of obtaining the test data above under this model (this sounds more complicated as it is, as you see in the code, we simply calculate the difference between predictions y = b + a*x and the observed y, and then we have to look up the probability densities (using dnorm) for such deviations to occur.

likelihood <- function(param){ a = param[1] b = param[2] sd = param[3] pred = a*x + b singlelikelihoods = dnorm(y, mean = pred, sd = sd, log = T) sumll = sum(singlelikelihoods) return(sumll) } # Example: plot the likelihood profile of the slope a slopevalues <- function(x){return(likelihood(c(x, trueB, trueSd)))} slopelikelihoods <- lapply(seq(3, 7, by=.05), slopevalues ) plot (seq(3, 7, by=.05), slopelikelihoods , type="l", xlab = "values of slope parameter a", ylab = "Log likelihood")

As an illustration, the last lines of the code plot the Likelihood for a range of parameter values of the slope parameter a. The result should look something like the plot to the right.

**Why we work with logarithms**

You might have noticed that I return the logarithm of the probabilities in the likelihood function, which is also the reason why I sum the probabilities of all our datapoints (the logarithm of a product equals the sum of the logarithms). Why do we do this? You don’t have to, but it’s strongly advisable because likelihoods, where a lot of small probabilities are multiplied, can get ridiculously small pretty fast (something like 10^-34). At some stage, computer programs are getting into numerical rounding or underflow problems then. So, bottom-line: when you program something with likelihoods, always use logarithms!!!

**Defining the prior**

As a second step, as always in Bayesian statistics, we have to specify a prior distribution for each parameter. To make it easy, I used uniform distributions and normal distributions for all three parameters. *[Some additional information for the “professionals”, skip this when you don’t understand what I’m talking about: while this choice can be considered pretty “uninformative” for the slope and intercept parameters, it is not really uninformative for the standard deviations. An uninformative prior for the latter would usually be scale with 1/sigma (if you want to understand the reason, see here). This stuff is important when you seriously dive into Bayesian statistics, but I didn’t want to make the code more confusing here.]*

# Prior distribution prior <- function(param){ a = param[1] b = param[2] sd = param[3] aprior = dunif(a, min=0, max=10, log = T) bprior = dnorm(b, sd = 5, log = T) sdprior = dunif(sd, min=0, max=30, log = T) return(aprior+bprior+sdprior) }

**The posterior**

The product of prior and likelihood is the actual quantity the MCMC will be working on. This function is called the posterior (or to be exact, it’s called the posterior after it’s normalized, which the MCMC will do for us, but let’s not be picky for the moment). Again, here we work with the sum because we work with logarithms.

posterior <- function(param){ return (likelihood(param) + prior(param)) }

**Figure:** A visualization of the Metropolis-Hastings MCMC, from Hartig et al., 2011. (copyright see publisher)

**The MCMC**

Now, here comes the actual Metropolis-Hastings algorithm. One of the most frequent applications of this algorithm (as in this example) is sampling from the posterior density in Bayesian statistics. In principle, however, the algorithm may be used to sample from any integrable function. So, the aim of this algorithm is to jump around in parameter space, but in a way that the probability to be at a point is proportional to the function we sample from (this is usually called the target function). In our case this is the posterior defined above.

This is achieved by

- Starting at a random parameter value
- Choosing a new parameter value close to the old value based on some probability density that is called the proposal function
- Jumping to this new point with a probability p(new)/p(old), where p is the target function, and p>1 means jumping as well

It’s fun to think about why that works, but for the moment I can assure you it does – when we run this algorithm, distribution of the parameters it visits converges to the target distribution p. So, let’s get this in R:

######## Metropolis algorithm ################ proposalfunction <- function(param){ return(rnorm(3,mean = param, sd= c(0.1,0.5,0.3))) } run_metropolis_MCMC <- function(startvalue, iterations){ chain = array(dim = c(iterations+1,3)) chain[1,] = startvalue for (i in 1:iterations){ proposal = proposalfunction(chain[i,]) probab = exp(posterior(proposal) - posterior(chain[i,])) if (runif(1) < probab){ chain[i+1,] = proposal }else{ chain[i+1,] = chain[i,] } } return(chain) } startvalue = c(4,0,10) chain = run_metropolis_MCMC(startvalue, 10000) burnIn = 5000 acceptance = 1-mean(duplicated(chain[-(1:burnIn),]))

Again, working with the logarithms of the posterior might be a bit confusing at first, in particular when you look at the line where the acceptance probability is calculated (probab = exp(posterior(proposal) – posterior(chain[i,]))). To understand why we do this, note that p1/p2 = exp[log(p1)-log(p2)].

The first steps of the algorithm may be biased by the initial value, and are therefore usually discarded for the further analysis (burn-in time). An interesting output to look at is the acceptance rate: how often was a proposal rejected by the metropolis-hastings acceptance criterion? The acceptance rate can be influenced by the proposal function: generally, the closer the proposals are, the larger the acceptance rate. Very high acceptance rates, however, are usually not beneficial: this means that the algorithms is “staying” at the same point, which results in a suboptimal probing of the parameter space (mixing). It can be shown that acceptance rates between 20% and 30% are optimal for typical applications (more on that here).

Finally, we can plot the results. There are more elegant ways of plotting this which I discuss in another recent post, so check this out, but for the moment I didn’t want to use any packages, so we do it the hard way:

### Summary: ####################### par(mfrow = c(2,3)) hist(chain[-(1:burnIn),1],nclass=30, , main="Posterior of a", xlab="True value = red line" ) abline(v = mean(chain[-(1:burnIn),1])) abline(v = trueA, col="red" ) hist(chain[-(1:burnIn),2],nclass=30, main="Posterior of b", xlab="True value = red line") abline(v = mean(chain[-(1:burnIn),2])) abline(v = trueB, col="red" ) hist(chain[-(1:burnIn),3],nclass=30, main="Posterior of sd", xlab="True value = red line") abline(v = mean(chain[-(1:burnIn),3]) ) abline(v = trueSd, col="red" ) plot(chain[-(1:burnIn),1], type = "l", xlab="True value = red line" , main = "Chain values of a", ) abline(h = trueA, col="red" ) plot(chain[-(1:burnIn),2], type = "l", xlab="True value = red line" , main = "Chain values of b", ) abline(h = trueB, col="red" ) plot(chain[-(1:burnIn),3], type = "l", xlab="True value = red line" , main = "Chain values of sd", ) abline(h = trueSd, col="red" ) # for comparison: summary(lm(y~x))

The resulting plots should look something like the plot below. You see that we retrieve more or less the original parameters that were used to create our data, and you also see that we get a certain area around the highest posterior values that also have some support by the data, which is the Bayesian equivalent of confidence intervals.

**Figure:** The upper row shows posterior estimates for slope (a), intercept (b) and standard deviation of the error (sd). The lower row shows the Markov Chain of parameter values.

**References for further reading**Gelman, A.; Carlin, J. B.; Stern, H. S. & Rubin, D. B. (2003) Bayesian Data Analysis

Andrieu, C.; de Freitas, N.; Doucet, A. & Jordan, M. I. (2003) An introduction to MCMC for machine learning Mach. Learning, Springer, 50, 5-43

Pingback: Setting up OpenBugs and R « theoretical ecology

Pingback: MCMC chain analysis and convergence diagnostics with coda in R « theoretical ecology

Thank you very much for an excellent blog post about MCMC. I have noticed, however, a bit strange thing in my opinion: in the “prior” function you return a sum, not a vector, of obtained value (return(aprior+bprior+sdprior)). Should not that be a vector? otherwise how else would you call again this function when you do sampling?

Anna

PhD student in evolutionary biology

Hi Anna,

glad you like the post.

Regarding your question: in principle, I could also return a vector with the prior probability for each parameter separaetely, but later those would be summed up anyways.

The posterior (which is what we are working on with the MCMC) is given by

P(Parameters|Data) = L(Data|Parameters) * Prior(Parameters)

where Prior(Parameters) is the prior probability that a certain combination of parameters will occurr together. As we assume here that the prior probabilities for each parameter are independent, Prior(Parameters) = Prior(Parameter1) * Prior(Parameter1) * … , or if you work with the log of the probabilities, it’s the sum – this is why I return the sum here, it’s basically already the joint prior for a certain combination of parameters to occurr together based on the individual priors we assumed.

Cheers,

Florian

Very nice post, it really helped me to understand the inner workings of an MCMC algorithm. Nice Work!

Thanks, I’m glad you found it helpful. Got me motivated to go over it another time and try add a few additional pieces of explanation at some points.

Wonderful post, it helped me a lot. But I have a question. In the section “Defining the prior”, you said “To make it easy, I used uniform distributions for all three parameters, but not that for the standard deviation”. For paramter “b”, you define bprior = dnorm(b, sd = 5, log = T) in the prior function. I’m confused.

You’re completely right, I somehow hadn’t looked properly at the code when I wrote this text, thanks for spotting this, I corrected the text.

I also added a bit more explanation on the point about uniform priors for sigma not being uniformative, I think this was also confusing before, hope it’s better understandable now!

Thank you so much! This simple example showed and helped me understand how MCMC works.

Pingback: A simple Approximate Bayesian Computation MCMC (ABC-MCMC) in R « theoretical ecology

Reblogged this on My Notepad.

great post. It helped me a lot, but i have a question. How can we use mcmc when we want to estimate the parameters of a multiple regression with 4 variables for example?

Just modify the model in the likelihood function (where it says now pred = a*x + b) and modify the code to accommodate the additional parameters where appropriate.

However, this example was meant for teaching how an MCMC works, if you seriously want to fit more complicated models, use OpenBugs or Jags or such alike, e.g. https://theoreticalecology.wordpress.com/2011/07/20/setting-up-openbugs-and-r/ – this is much more efficient because of faster implementation and proposal adjustment.

This question also came up at http://stats.stackexchange.com/questions/147600/metropolis-hastings-mcmc-for-bayesian-regression-in-r/147691

Hi Florian,

Thank you very much for a very clear introduction on MCMC. However, after playing around with your code, I have noticed a problem:

The performance of the algorithm seems to get progressively worse as the sample size increases. E.g. sampleSize <- 301 and sampleSize <- 3001. The acceptance rate becomes extremely low in the latter case. As the log-likelihoods become large negative numbers, so will their differences (i.e., posterior(proposal) – posterior(chain[i,]) is a very large negative number for most proposed steps).

When sample size is large, it seems to be difficult to get the chains to mix well again by reducing the variances of the proposal distributions.

Am I missing something, or is this a common problem? Is there a way to get around this problem, or is it a matter of find tuning the proposal variances?

Best regards,

Wilson

Hi Wilson,

your observation is correct. The reason is that if you have more data, the posterior gets more narrow. As a result of that, the proposal function that I suggested is too wide and produces lots of rejections – adjust it and it will get better (professional samplers like JAGS do this automatically for you). I explain this in detail in a follow-up post https://theoreticalecology.wordpress.com/2011/12/09/mcmc-chain-analysis-and-convergence-diagnostics-with-coda-in-r/ . Also, note that with so much data, you initial guess might be much worse (in terms of relative goodness of fit) than before, so you might need to use a longer burn-in time. As a third comment, using that amount of data, it becomes very clear now that a flat prior is not uninformative, i.e. it carries a preference towards larger values for the standard deviation. If you want a more uninformative prior, a 1/sd prior scaling for the standard deviation would probably correct that and lead asymptotically to a better retrieval of the original parameters.

That makes so much sense now! Thanks again Florian. There is so much to learn in computational Bayesian, and it’s so interesting!

Glad you found it helpful. F

Pingback: Olhando o MCMC mais de perto… « Recologia

Hello Florian, the explanation is superb and very helpful!!! I have one question though, i was trying to use the same code to solve another(similar) simple regression question. I run into trouble especially in the proposal function. I suspect it is in the choice of the std deviations. In your example above, how did you choose the sd values sd= c(0.1,0.5,0.3), in the proposal function? Thank you in advance.

Hi Ngesa, first have a look at the last section of this post https://theoreticalecology.wordpress.com/2011/12/09/mcmc-chain-analysis-and-convergence-diagnostics-with-coda-in-r/ which explains the basic problem.

Basically, you want your proposal to fit to the shape of the posterior you sample from – very brief, one could say that for unimodal posterior distributions, a proposal with a similar shape as the posterior (including correlations between parameters), but a bit shrunk as compared to the posterior usually works well.

How much you should shrink depends on the number of parameters you vary in the proposal, the shape of the posterior and the how well your proposal mimics that shape – typically you will have to adjust the proposal after running your sampler for the first time (that is all right as long as you discard all samples during adjustment). As an initial guess (don’t quote me for that) I would start for a typical regression problem with 3-8 parameters and all parameters varied at once with 1/10th of the posterior uncertainty that I expect, but again, this depends a lot of how much parameters you have and how much correlation I would expect.

For more details see

Gelman, Andrew, et al. Bayesian data analysis. Chapman & Hall/CRC, 2004.

Andrieu, Christophe, and Johannes Thoms. “A tutorial on adaptive MCMC.” Statistics and Computing 18.4 (2008): 343-373.

I was wondering what if instead of log =T, I substitute log = F. How does that affect the over all results?

If you can please explain then I would greatly appreciate it.

Thank you

In principle it’s the same of course, it’s simply that computers have problems storing small numbers, as explained in the paragraph “why we work with likelihoods”

Try it out, if you create sufficiently large data, you will get rounding or underflow problems when working with Log = F (of course you have to update the formulas, e.g. instead of probab = exp(posterior(proposal) – posterior(chain[i,])) you sould have probab = posterior(proposal) / posterior(chain[i,]). We sometimes do this as an exercise in our courses.

Thank you, For your insight. The probab = exp(likelihood(proposal[i])+ prior(proposal[i]) – likelihood(chain[i,])- prior(chain[i,])). I understand that the likelihood function is used for determining the model performance. What is the purpose of using the prior.

We explain the motivation in http://ecologypapers.blog.com/files/2012/10/Hartig-Connectingdynamicvegetation-2012.pdf , in the section on “INFERRING PARAMETERS AND PROCESSES BY INVERSE MODELLING”.

this may be a basic question, but in the above you call a the intercept. isn’t it the slope of the regression line? forgive me if there is something obvious i am missing.

you’re right, I wrote intercept but plotted the likelihood for the slope, my bad, I’ve corrected the code. Thanks for the hint.

thanks. minor point but it tripped me up for a bit. thanks so much for this post. i am working through it and it has already been very helpful!

Pingback: MCMC for maximum likelihood in R | Trees 'r' Stats

Pingback: Running Bayesian models | Sam Clifford

Super helpful post !

I have been reading books, chapters, presentations, and papers about MCMC in the last 2 months. This post is just in time to clear up my cloudy neural network in my brain for MCMC.

Pingback: MCMC | Letian's Blog

Hi Florian

I run your script, but no show the web result. I tried any times, but not solve. I have set my seed?

I share the plot https://dl.dropboxusercontent.com/u/109431545/plot_hm.pdf

Thanks for you help me.

Paúl A Carrillo Maldonado

Looks roughly the same to me, except that you have a different size of your plot window, this may depend on your OS / editor / output device.

I would say everything is fine. Small differences of the results are expected of course, because, as you say, we haven’t fixed the random seed, so each time we are running this we have a slightly different dataset.

Pingback: A simple explanation of rejection sampling in R | theoretical ecology

Pingback: Making the Most of MCMC | Sweet Tea, Science

Thank you Hartig for you exceptional help to understand MCMC.

Thanks, I’m glad it was useful to you!

Thanks a lot for this very helpful post. I have just a doubt. Is it normal than the prior distributions are not updated through the iterations of the chain?

Thanks again

Fabio

The priors are not updated (per definition, the prior is fixed), but it influences the development of the chain through the posterior function (remember the posterior contains the prior as one part) – in that sense, the prior is considered in the updates.

Thanks for the fast replay.

Fabio

Amazing, great post, is it possible to re-calculate it by excel so it would make easy to understand for those who unfamiliar with R.

thaks

Haolia

Hi Haolia, glad you liked the post! About Excel – I guess it would be possible, but I have to admit I don’t see the point. Probably I would have to use Visual Basic to do this in Excel, and if you understand VBA, you should also be able to read R code. In general, if you want to do serious stats, the best thing to do is to stop using Excel for anything else than data entering, and get used to R.

Dear Florian, finally i can make the excel version.. very inspired.. thank you,

Haolia

Great tutorial on how to use MCMC to estimate a posterior distribution. Thanks a lot!

Great post. Really cleared things up for me. Question, however. If a non-symmetric proposal function was used a “correction factor” would be required, correct? I’m familiar with how to do it for a single parameter but, if required, how would it be implemented in the multi-parameter case you’ve described? Thanks.

Yes, if you have a non-symmetric proposal distribution g(x’|x) for going from x to x’, you will have to correct the Metropolis ratio with g(x’|x) / g(x|x’). This works for any number of dimensions.

In practice, this is rarely used though. Most situations where a non-symmetric proposal would be useful can better be dealt with by applying a transformation on the parameter space.

Hello, that’s great tutorial and sample. Thank you! But in the likelihood function, you are using the truth data x as the prediction. What should I use if x is unknown? I only have data y, and know that y is generated from an unkown x. Thanks.

Hi Jerome – in a regression, the assumption is that you know both the observed response (y) and the predictor (x), and what you want to estimate is the unknown effect of x on y.

The problem you have seems different, but I’m afraid you will have to be a bit more specific about what you want to achieve to get sensible help.

If you want to do something that is completely different from a regression, I would suggest that a forum such as stats.stackexchange.com/ is a more suitable place to get help than this post.

Very helpful post! I read several articles and book chapters on MCMC, but this post makes it more concrete for me. Can you help me with a question though? Why didn’t you update each parameter separately? Thank you!

Glad this was useful to you Guoguo.

About your question – sure, one can update one parameter at a time. This algorithm is called Metropolis-within-Gibbs. It is one of the many extensions of the standard Metropolis that aim at improving the speed of convergence for higher-dimensional more complex target distributions. For low-dimensional normal targets it will converge slightly slower though.

Thank you. Dr. Hartig!

Oh my GOD!! This is an extremely well executed post! Thank you very much to make it so clear! This is the exact example of how to be a teacher.

Hi,

thanks a lot for very helpful code. I just have one question. Which parts of the code except number of parameters and size of the matrix should be changed in order to make it suitable for estimating just one parameter?

Hi mabr, this should be easy enough – have you tried if you can manage yourself? If you need help, the post by Darren Wilkinson that I link to above is a 1-d example.

I’m not sure this is a Metropolis-Hastings implementation rather than Metropolis only. The former is for asymmetric distributions and includes the marginal probabilities in the accept/reject criteria ratio. In this case, because you are using symmetric distributions (normal and uniform) they are equivalent (I think anyway), but if you were to include asymmetric distributions (e.g. beta, gamma, lognormal) then this would not be a Metropolis-Hastings example.

Your line of code below does not include the marginal probabilities.

probab = exp(posterior(proposal) – posterior(chain[i,]))

See Hobbs and Hooten 2015 Ch. 7. Please correct me if I’m wrong.

I guess that’s true – I didn’t code the correction term because I used a symmetric proposal (in this case, the correction is always 1), but I guess it might be more pedagogical to add it, or change the name.

Note, however, that the correction works with the proposal distribution – you seem to refer to the prior distribution, the prior is accounted for here.

I didn’t get what you mean by the “Your line of code below does not include the marginal probabilities. …”