A simple Metropolis-Hastings MCMC in R

Florian Hartig / September 17, 2010

While there are certainly good software packages out there to do the job for you, notably BUGS or JAGS, but also our own BayesianTools package with general-purpose MCMC samplers, it is instructive to program a simple MCMC yourself. In this post, I give an educational example of the Bayesian equivalent of a linear regression, sampled by an MCMC with Metropolis-Hastings steps, based on an earlier version which I did to together with Tamara Münkemüller. Since first publishing this post, I have made a few small modifications to improve clarity. A similar post on Metropolis-Hastings MCMC algorithms by Darren Wilkinson is also worth looking at. The entire code can be found here, more on analyzing the results of this algorithm can be found in a recent post.

Creating test data

As a first step, we create some test data that will be used to fit our model. Let’s assume a linear relationship between the predictor and the response variable, so we take a linear model and add some noise.


trueA = 5
trueB = 0
trueSd = 10
sampleSize = 31

# create independent x-values
x =(-(sampleSize-1)/2):((sampleSize-1)/2)
# create dependent values according to ax + b + N(0,sd)
y = trueA * x + trueB + rnorm(n=sampleSize,mean=0,sd=trueSd)

plot(x,y, main="Test Data")

I balanced x values around zero to “de-correlate” slope and intercept. The result should look something like the figure to the right

Defining the statistical model

The next step is to specify the statistical model. We already know that the data was created with a linear relationship y = a*x + b between x and y and a normal error model N(0,sd) with standard deviation sd, so let’s use the same model for the fit and see if we can retrieve our original parameter values.

Derive the likelihood function from the model

For estimating parameters in a Bayesian analysis, we need to derive the likelihood function for the model that we want to fit. The likelihood is the probability (density) with which we would expect the observed data to occur conditional on the parameters of the model that we look at. So, given that our linear model y = b + a*x + N(0,sd) takes the parameters (a, b, sd) as an input, we have to return the probability of obtaining the test data above under this model (this sounds more complicated as it is, as you see in the code, we simply calculate the difference between predictions y = b + a*x and the observed y, and then we have to look up the probability densities (using dnorm) for such deviations to occur.


likelihood = function(param){
	a = param[1]
	b = param[2]
	sd = param[3]

	pred = a*x + b
	singlelikelihoods = dnorm(y, mean = pred, sd = sd, log = T)
	sumll = sum(singlelikelihoods)
	return(sumll)
}

# Example: plot the likelihood profile of the slope a
slopevalues = function(x){return(likelihood(c(x, trueB, trueSd)))}
slopelikelihoods = lapply(seq(3, 7, by=.05), slopevalues )
plot (seq(3, 7, by=.05), slopelikelihoods , type="l", xlab = "values of slope parameter a", ylab = "Log likelihood")

Log likelihood profile of the slope parameter

As an illustration, the last lines of the code plot the Likelihood for a range of parameter values of the slope parameter a. The result should look something like the plot to the right.

Why we work with logarithms

You might have noticed that I return the logarithm of the probabilities in the likelihood function, which is also the reason why I sum the probabilities of all our datapoints (the logarithm of a product equals the sum of the logarithms). Why do we do this? You don’t have to, but it’s strongly advisable because likelihoods, where a lot of small probabilities are multiplied, can get ridiculously small pretty fast (something like 10^-34). At some stage, computer programs are getting into numerical rounding or underflow problems then. So, bottom-line: when you program something with likelihoods, always use logarithms!!!

Defining the prior

As a second step, as always in Bayesian statistics, we have to specify a prior distribution for each parameter. To make it easy, I used uniform distributions and normal distributions for all three parameters. [Some additional information for the “professionals”, skip this when you don’t understand what I’m talking about: while this choice can be considered pretty “uninformative” for the slope and intercept parameters, it is not really uninformative for the standard deviations. An uninformative prior for the latter would usually be scale with 1/sigma (if you want to understand the reason, see here). This stuff is important when you seriously dive into Bayesian statistics, but I didn’t want to make the code more confusing here.]


# Prior distribution
prior = function(param){
	a = param[1]
	b = param[2]
	sd = param[3]
	aprior = dunif(a, min=0, max=10, log = T)
	bprior = dnorm(b, sd = 5, log = T)
	sdprior = dunif(sd, min=0, max=30, log = T)
	return(aprior+bprior+sdprior)
}

The posterior

The product of prior and likelihood is the actual quantity the MCMC will be working on. This function is called the posterior (or to be exact, it’s called the posterior after it’s normalized, which the MCMC will do for us, but let’s not be picky for the moment). Again, here we work with the sum because we work with logarithms.


posterior = function(param){
   return (likelihood(param) + prior(param))
}

Figure: A visualization of the Metropolis-Hastings MCMC, from Hartig et al., 2011. (copyright see publisher)

The MCMC

Now, here comes the actual Metropolis-Hastings algorithm. One of the most frequent applications of this algorithm (as in this example) is sampling from the posterior density in Bayesian statistics. In principle, however, the algorithm may be used to sample from any integrable function. So, the aim of this algorithm is to jump around in parameter space, but in a way that the probability to be at a point is proportional to the function we sample from (this is usually called the target function). In our case this is the posterior defined above.

This is achieved by

Starting at a random parameter value
Choosing a new parameter value close to the old value based on some probability density that is called the proposal function
Jumping to this new point with a probability p(new)/p(old), where p is the target function, and p>1 means jumping as well

It’s fun to think about why that works, but for the moment I can assure you it does – when we run this algorithm, distribution of the parameters it visits converges to the target distribution p. So, let’s get this in R:


######## Metropolis algorithm ################

proposalfunction = function(param){
	return(rnorm(3,mean = param, sd= c(0.1,0.5,0.3)))
}

run_metropolis_MCMC = function(startvalue, iterations){
	chain = array(dim = c(iterations+1,3))
	chain[1,] = startvalue
	for (i in 1:iterations){
		proposal = proposalfunction(chain[i,])

		probab = exp(posterior(proposal) - posterior(chain[i,]))
		if (runif(1) < probab){
			chain[i+1,] = proposal
		}else{
			chain[i+1,] = chain[i,]
		}
	}
	return(chain)
}

startvalue = c(4,0,10)
chain = run_metropolis_MCMC(startvalue, 10000)

burnIn = 5000
acceptance = 1-mean(duplicated(chain[-(1:burnIn),]))

Again, working with the logarithms of the posterior might be a bit confusing at first, in particular when you look at the line where the acceptance probability is calculated (probab = exp(posterior(proposal) – posterior(chain[i,]))). To understand why we do this, note that p1/p2 = exp[log(p1)-log(p2)].

The first steps of the algorithm may be biased by the initial value, and are therefore usually discarded for the further analysis (burn-in time). An interesting output to look at is the acceptance rate: how often was a proposal rejected by the metropolis-hastings acceptance criterion? The acceptance rate can be influenced by the proposal function: generally, the closer the proposals are, the larger the acceptance rate. Very high acceptance rates, however, are usually not beneficial: this means that the algorithms is “staying” at the same point, which results in a suboptimal probing of the parameter space (mixing). It can be shown that acceptance rates between 20% and 30% are optimal for typical applications (more on that here).

Finally, we can plot the results. There are more elegant ways of plotting this which I discuss in another recent post, so check this out, but for the moment I didn’t want to use any packages, so we do it the hard way:


### Summary: #######################

par(mfrow = c(2,3))
hist(chain[-(1:burnIn),1],nclass=30, , main="Posterior of a", xlab="True value = red line" )
abline(v = mean(chain[-(1:burnIn),1]))
abline(v = trueA, col="red" )
hist(chain[-(1:burnIn),2],nclass=30, main="Posterior of b", xlab="True value = red line")
abline(v = mean(chain[-(1:burnIn),2]))
abline(v = trueB, col="red" )
hist(chain[-(1:burnIn),3],nclass=30, main="Posterior of sd", xlab="True value = red line")
abline(v = mean(chain[-(1:burnIn),3]) )
abline(v = trueSd, col="red" )
plot(chain[-(1:burnIn),1], type = "l", xlab="True value = red line" , main = "Chain values of a", )
abline(h = trueA, col="red" )
plot(chain[-(1:burnIn),2], type = "l", xlab="True value = red line" , main = "Chain values of b", )
abline(h = trueB, col="red" )
plot(chain[-(1:burnIn),3], type = "l", xlab="True value = red line" , main = "Chain values of sd", )
abline(h = trueSd, col="red" )

# for comparison:
summary(lm(y~x))

The resulting plots should look something like the plot below. You see that we retrieve more or less the original parameters that were used to create our data, and you also see that we get a certain area around the highest posterior values that also have some support by the data, which is the Bayesian equivalent of confidence intervals.

Figure: The upper row shows posterior estimates for slope (a), intercept (b) and standard deviation of the error (sd). The lower row shows the Markov Chain of parameter values.

References for further readingGelman, A.; Carlin, J. B.; Stern, H. S. & Rubin, D. B. (2003) Bayesian Data Analysis

Andrieu, C.; de Freitas, N.; Doucet, A. & Jordan, M. I. (2003) An introduction to MCMC for machine learning Mach. Learning, Springer, 50, 5-43

Hartig, F.; Calabrese, J. M.; Reineking, B.; Wiegand, T. & Huth, A. (2011) Statistical inference for stochastic simulation models – theory and application Ecol. Lett., 14, 816–827.

95 thoughts on “A simple Metropolis-Hastings MCMC in R”

Pingback: Setting up OpenBugs and R « theoretical ecology
Pingback: MCMC chain analysis and convergence diagnostics with coda in R « theoretical ecology
Anna Kostikova says:

April 11, 2012 at 10:32 am

Thank you very much for an excellent blog post about MCMC. I have noticed, however, a bit strange thing in my opinion: in the “prior” function you return a sum, not a vector, of obtained value (return(aprior+bprior+sdprior)). Should not that be a vector? otherwise how else would you call again this function when you do sampling?
Anna
PhD student in evolutionary biology

LikeLike

Reply
- Florian Hartig says:
  
  April 11, 2012 at 10:53 am
  
  Hi Anna,
  
  glad you like the post.
  
  Regarding your question: in principle, I could also return a vector with the prior probability for each parameter separaetely, but later those would be summed up anyways.
  
  The posterior (which is what we are working on with the MCMC) is given by
  
  P(Parameters|Data) = L(Data|Parameters) * Prior(Parameters)
  
  where Prior(Parameters) is the prior probability that a certain combination of parameters will occurr together. As we assume here that the prior probabilities for each parameter are independent, Prior(Parameters) = Prior(Parameter1) * Prior(Parameter1) * … , or if you work with the log of the probabilities, it’s the sum – this is why I return the sum here, it’s basically already the joint prior for a certain combination of parameters to occurr together based on the individual priors we assumed.
  
  Cheers,
  Florian
  
  LikeLike
  
  Reply
Thomas says:

July 5, 2012 at 12:48 pm

Very nice post, it really helped me to understand the inner workings of an MCMC algorithm. Nice Work!

LikeLiked by 1 person

Reply
Florian Hartig says:

July 5, 2012 at 2:10 pm

Thanks, I’m glad you found it helpful. Got me motivated to go over it another time and try add a few additional pieces of explanation at some points.

LikeLike

Reply
Xuewen Liu says:

July 7, 2012 at 2:27 am

Wonderful post, it helped me a lot. But I have a question. In the section “Defining the prior”, you said “To make it easy, I used uniform distributions for all three parameters, but not that for the standard deviation”. For paramter “b”, you define bprior = dnorm(b, sd = 5, log = T) in the prior function. I’m confused.

LikeLike

Reply
- Florian Hartig says:
  
  July 7, 2012 at 9:25 am
  
  You’re completely right, I somehow hadn’t looked properly at the code when I wrote this text, thanks for spotting this, I corrected the text.
  
  LikeLike
  
  Reply
- Florian Hartig says:
  
  July 7, 2012 at 9:45 am
  
  I also added a bit more explanation on the point about uniform priors for sigma not being uniformative, I think this was also confusing before, hope it’s better understandable now!
  
  LikeLike
  
  Reply
  - Xuewen Liu says:
    
    July 8, 2012 at 2:00 am
    
    Thank you so much! This simple example showed and helped me understand how MCMC works.
    
    LikeLike
    
    Reply
Pingback: A simple Approximate Bayesian Computation MCMC (ABC-MCMC) in R « theoretical ecology
ifort says:

August 8, 2012 at 1:25 pm

Reblogged this on My Notepad.

LikeLike

Reply
johan says:

August 10, 2012 at 1:48 pm

great post. It helped me a lot, but i have a question. How can we use mcmc when we want to estimate the parameters of a multiple regression with 4 variables for example?

LikeLike

Reply
- Florian Hartig says:
  
  August 10, 2012 at 5:34 pm
  
  Just modify the model in the likelihood function (where it says now pred = a*x + b) and modify the code to accommodate the additional parameters where appropriate.
  
  However, this example was meant for teaching how an MCMC works, if you seriously want to fit more complicated models, use OpenBugs or Jags or such alike, e.g. https://theoreticalecology.wordpress.com/2011/07/20/setting-up-openbugs-and-r/ – this is much more efficient because of faster implementation and proposal adjustment.
  
  LikeLike
  
  Reply
- Florian Hartig says:
  
  April 29, 2015 at 9:26 am
  
  This question also came up at http://stats.stackexchange.com/questions/147600/metropolis-hastings-mcmc-for-bayesian-regression-in-r/147691
  
  LikeLike
  
  Reply
Wilson Chen says:

August 27, 2012 at 1:43 pm

Hi Florian,

Thank you very much for a very clear introduction on MCMC. However, after playing around with your code, I have noticed a problem:

The performance of the algorithm seems to get progressively worse as the sample size increases. E.g. sampleSize <- 301 and sampleSize <- 3001. The acceptance rate becomes extremely low in the latter case. As the log-likelihoods become large negative numbers, so will their differences (i.e., posterior(proposal) – posterior(chain[i,]) is a very large negative number for most proposed steps).

When sample size is large, it seems to be difficult to get the chains to mix well again by reducing the variances of the proposal distributions.

Am I missing something, or is this a common problem? Is there a way to get around this problem, or is it a matter of find tuning the proposal variances?

Best regards,
Wilson

LikeLike

Reply
- Florian Hartig says:
  
  August 27, 2012 at 2:23 pm
  
  Hi Wilson,
  
  your observation is correct. The reason is that if you have more data, the posterior gets more narrow. As a result of that, the proposal function that I suggested is too wide and produces lots of rejections – adjust it and it will get better (professional samplers like JAGS do this automatically for you). I explain this in detail in a follow-up post https://theoreticalecology.wordpress.com/2011/12/09/mcmc-chain-analysis-and-convergence-diagnostics-with-coda-in-r/ . Also, note that with so much data, you initial guess might be much worse (in terms of relative goodness of fit) than before, so you might need to use a longer burn-in time. As a third comment, using that amount of data, it becomes very clear now that a flat prior is not uninformative, i.e. it carries a preference towards larger values for the standard deviation. If you want a more uninformative prior, a 1/sd prior scaling for the standard deviation would probably correct that and lead asymptotically to a better retrieval of the original parameters.
  
  LikeLike
  
  Reply
  - Wilson Chen says:
    
    August 27, 2012 at 4:07 pm
    
    That makes so much sense now! Thanks again Florian. There is so much to learn in computational Bayesian, and it’s so interesting!
    
    LikeLike
    
    Reply
    - Florian Hartig says:
      
      August 27, 2012 at 11:16 pm
      
      Glad you found it helpful. F
      
      LikeLike
      
      Reply
Pingback: Olhando o MCMC mais de perto… « Recologia
Ngesa says:

April 6, 2013 at 5:06 pm

Hello Florian, the explanation is superb and very helpful!!! I have one question though, i was trying to use the same code to solve another(similar) simple regression question. I run into trouble especially in the proposal function. I suspect it is in the choice of the std deviations. In your example above, how did you choose the sd values sd= c(0.1,0.5,0.3), in the proposal function? Thank you in advance.

LikeLike

Reply
- Florian Hartig says:
  
  April 6, 2013 at 5:25 pm
  
  Hi Ngesa, first have a look at the last section of this post https://theoreticalecology.wordpress.com/2011/12/09/mcmc-chain-analysis-and-convergence-diagnostics-with-coda-in-r/ which explains the basic problem.
  
  Basically, you want your proposal to fit to the shape of the posterior you sample from – very brief, one could say that for unimodal posterior distributions, a proposal with a similar shape as the posterior (including correlations between parameters), but a bit shrunk as compared to the posterior usually works well.
  
  How much you should shrink depends on the number of parameters you vary in the proposal, the shape of the posterior and the how well your proposal mimics that shape – typically you will have to adjust the proposal after running your sampler for the first time (that is all right as long as you discard all samples during adjustment). As an initial guess (don’t quote me for that) I would start for a typical regression problem with 3-8 parameters and all parameters varied at once with 1/10th of the posterior uncertainty that I expect, but again, this depends a lot of how much parameters you have and how much correlation I would expect.
  
  For more details see
  
  Gelman, Andrew, et al. Bayesian data analysis. Chapman & Hall/CRC, 2004.
  
  Andrieu, Christophe, and Johannes Thoms. “A tutorial on adaptive MCMC.” Statistics and Computing 18.4 (2008): 343-373.
  
  LikeLiked by 1 person
  
  Reply
Samarth says:

October 9, 2013 at 8:47 pm

I was wondering what if instead of log =T, I substitute log = F. How does that affect the over all results?
If you can please explain then I would greatly appreciate it.
Thank you

LikeLike

Reply
- Florian Hartig says:
  
  October 9, 2013 at 9:39 pm
  
  In principle it’s the same of course, it’s simply that computers have problems storing small numbers, as explained in the paragraph “why we work with likelihoods”
  
  Try it out, if you create sufficiently large data, you will get rounding or underflow problems when working with Log = F (of course you have to update the formulas, e.g. instead of probab = exp(posterior(proposal) – posterior(chain[i,])) you sould have probab = posterior(proposal) / posterior(chain[i,]). We sometimes do this as an exercise in our courses.
  
  LikeLike
  
  Reply
  - Samarth says:
    
    April 23, 2014 at 2:39 pm
    
    Thank you, For your insight. The probab = exp(likelihood(proposal[i])+ prior(proposal[i]) – likelihood(chain[i,])- prior(chain[i,])). I understand that the likelihood function is used for determining the model performance. What is the purpose of using the prior.
    
    LikeLike
    
    Reply
    - Florian Hartig says:
      
      April 23, 2014 at 4:44 pm
      
      We explain the motivation in http://ecologypapers.blog.com/files/2012/10/Hartig-Connectingdynamicvegetation-2012.pdf , in the section on “INFERRING PARAMETERS AND PROCESSES BY INVERSE MODELLING”.
      
      LikeLike
      
      Reply
drill says:

December 16, 2013 at 10:23 pm

this may be a basic question, but in the above you call a the intercept. isn’t it the slope of the regression line? forgive me if there is something obvious i am missing.

LikeLike

Reply
- Florian Hartig says:
  
  December 17, 2013 at 1:09 pm
  
  you’re right, I wrote intercept but plotted the likelihood for the slope, my bad, I’ve corrected the code. Thanks for the hint.
  
  LikeLike
  
  Reply
  - drill says:
    
    December 17, 2013 at 3:36 pm
    
    thanks. minor point but it tripped me up for a bit. thanks so much for this post. i am working through it and it has already been very helpful!
    
    LikeLike
    
    Reply
Pingback: MCMC for maximum likelihood in R | Trees 'r' Stats
Pingback: Running Bayesian models | Sam Clifford
Larry Yang says:

October 20, 2014 at 5:13 pm

Super helpful post !
I have been reading books, chapters, presentations, and papers about MCMC in the last 2 months. This post is just in time to clear up my cloudy neural network in my brain for MCMC.

LikeLike

Reply
Pingback: MCMC | Letian's Blog
Paúl A Carrillo Maldonado says:

December 9, 2014 at 2:52 pm

Hi Florian

I run your script, but no show the web result. I tried any times, but not solve. I have set my seed?

I share the plot https://dl.dropboxusercontent.com/u/109431545/plot_hm.pdf

Thanks for you help me.

Paúl A Carrillo Maldonado

LikeLike

Reply
- Florian Hartig says:
  
  December 9, 2014 at 4:36 pm
  
  Looks roughly the same to me, except that you have a different size of your plot window, this may depend on your OS / editor / output device.
  
  I would say everything is fine. Small differences of the results are expected of course, because, as you say, we haven’t fixed the random seed, so each time we are running this we have a slightly different dataset.
  
  LikeLike
  
  Reply
Pingback: A simple explanation of rejection sampling in R | theoretical ecology
Pingback: Making the Most of MCMC | Sweet Tea, Science
Watheq J. Al-Mudhafar says:

November 29, 2015 at 5:00 pm

Thank you Hartig for you exceptional help to understand MCMC.

LikeLike

Reply
- Florian Hartig says:
  
  November 30, 2015 at 4:33 pm
  
  Thanks, I’m glad it was useful to you!
  
  LikeLike
  
  Reply
Fabio says:

December 9, 2015 at 5:09 pm

Thanks a lot for this very helpful post. I have just a doubt. Is it normal than the prior distributions are not updated through the iterations of the chain?
Thanks again
Fabio

LikeLike

Reply
- Florian Hartig says:
  
  December 9, 2015 at 5:19 pm
  
  The priors are not updated (per definition, the prior is fixed), but it influences the development of the chain through the posterior function (remember the posterior contains the prior as one part) – in that sense, the prior is considered in the updates.
  
  LikeLike
  
  Reply
  - Fabio says:
    
    December 9, 2015 at 5:25 pm
    
    Thanks for the fast replay.
    Fabio
    
    LikeLike
    
    Reply
haolia says:

January 17, 2016 at 11:12 am

Amazing, great post, is it possible to re-calculate it by excel so it would make easy to understand for those who unfamiliar with R.
thaks
Haolia

LikeLike

Reply
- Florian Hartig says:
  
  January 17, 2016 at 1:21 pm
  
  Hi Haolia, glad you liked the post! About Excel – I guess it would be possible, but I have to admit I don’t see the point. Probably I would have to use Visual Basic to do this in Excel, and if you understand VBA, you should also be able to read R code. In general, if you want to do serious stats, the best thing to do is to stop using Excel for anything else than data entering, and get used to R.
  
  LikeLike
  
  Reply
  - haolia says:
    
    January 26, 2016 at 7:08 am
    
    Dear Florian, finally i can make the excel version.. very inspired.. thank you,
    Haolia
    
    LikeLike
    
    Reply
songuke says:

April 6, 2016 at 5:12 am

Great tutorial on how to use MCMC to estimate a posterior distribution. Thanks a lot!

LikeLike

Reply
John says:

July 27, 2016 at 10:56 pm

Great post. Really cleared things up for me. Question, however. If a non-symmetric proposal function was used a “correction factor” would be required, correct? I’m familiar with how to do it for a single parameter but, if required, how would it be implemented in the multi-parameter case you’ve described? Thanks.

LikeLike

Reply
- Florian Hartig says:
  
  August 1, 2016 at 5:22 pm
  
  Yes, if you have a non-symmetric proposal distribution g(x’|x) for going from x to x’, you will have to correct the Metropolis ratio with g(x’|x) / g(x|x’). This works for any number of dimensions.
  
  In practice, this is rarely used though. Most situations where a non-symmetric proposal would be useful can better be dealt with by applying a transformation on the parameter space.
  
  LikeLike
  
  Reply
Jerome Cao says:

October 8, 2016 at 9:46 pm

Hello, that’s great tutorial and sample. Thank you! But in the likelihood function, you are using the truth data x as the prediction. What should I use if x is unknown? I only have data y, and know that y is generated from an unkown x. Thanks.

LikeLike

Reply
- Florian Hartig says:
  
  October 12, 2016 at 2:58 pm
  
  Hi Jerome – in a regression, the assumption is that you know both the observed response (y) and the predictor (x), and what you want to estimate is the unknown effect of x on y.
  
  The problem you have seems different, but I’m afraid you will have to be a bit more specific about what you want to achieve to get sensible help.
  
  If you want to do something that is completely different from a regression, I would suggest that a forum such as stats.stackexchange.com/ is a more suitable place to get help than this post.
  
  LikeLike
  
  Reply
Guoguo Z. says:

November 22, 2016 at 4:32 am

Very helpful post! I read several articles and book chapters on MCMC, but this post makes it more concrete for me. Can you help me with a question though? Why didn’t you update each parameter separately? Thank you!

LikeLike

Reply
- Florian Hartig says:
  
  November 22, 2016 at 12:53 pm
  
  Glad this was useful to you Guoguo.
  
  About your question – sure, one can update one parameter at a time. This algorithm is called Metropolis-within-Gibbs. It is one of the many extensions of the standard Metropolis that aim at improving the speed of convergence for higher-dimensional more complex target distributions. For low-dimensional normal targets it will converge slightly slower though.
  
  LikeLike
  
  Reply
  - Guoguo Z. says:
    
    November 26, 2016 at 3:57 am
    
    Thank you. Dr. Hartig!
    
    LikeLike
    
    Reply
Vianney says:

December 2, 2016 at 3:21 pm

Oh my GOD!! This is an extremely well executed post! Thank you very much to make it so clear! This is the exact example of how to be a teacher.

LikeLike

Reply
mabr says:

December 29, 2016 at 9:14 am

Hi,

thanks a lot for very helpful code. I just have one question. Which parts of the code except number of parameters and size of the matrix should be changed in order to make it suitable for estimating just one parameter?

LikeLike

Reply
- Florian Hartig says:
  
  December 29, 2016 at 7:34 pm
  
  Hi mabr, this should be easy enough – have you tried if you can manage yourself? If you need help, the post by Darren Wilkinson that I link to above is a 1-d example.
  
  LikeLike
  
  Reply
northcoastratRussell says:

February 3, 2017 at 7:35 pm

I’m not sure this is a Metropolis-Hastings implementation rather than Metropolis only. The former is for asymmetric distributions and includes the marginal probabilities in the accept/reject criteria ratio. In this case, because you are using symmetric distributions (normal and uniform) they are equivalent (I think anyway), but if you were to include asymmetric distributions (e.g. beta, gamma, lognormal) then this would not be a Metropolis-Hastings example.

Your line of code below does not include the marginal probabilities.
probab = exp(posterior(proposal) – posterior(chain[i,]))

See Hobbs and Hooten 2015 Ch. 7. Please correct me if I’m wrong.

LikeLike

Reply
- Florian Hartig says:
  
  February 3, 2017 at 10:28 pm
  
  I guess that’s true – I didn’t code the correction term because I used a symmetric proposal (in this case, the correction is always 1), but I guess it might be more pedagogical to add it, or change the name.
  
  Note, however, that the correction works with the proposal distribution – you seem to refer to the prior distribution, the prior is accounted for here.
  
  I didn’t get what you mean by the “Your line of code below does not include the marginal probabilities. …”
  
  LikeLike
  
  Reply
mobze says:

April 19, 2017 at 9:52 pm

Excellent post with nice responses! Many of my questions were answered in the discussion below!

However, I still don’t understand how you chose sd = c(0.1,0.5,0.3) in the proposal. I understand that you will adapt this based on your expectations on the uncertainty you expect:

“As an initial guess (don’t quote me for that) [sorry I quoted you here…] I would start for a typical regression problem with 3-8 parameters and all parameters varied at once with 1/10th of the posterior uncertainty that I expect”

How should I “guess” those values and how do you get a posterior uncertainty?

Also, why are you comparing the probability of acceptance with a random uniform number (between 0 and 1)? Is it just a convenience in the coding? I bet that in JAGS or STAN it’s way more complex than that!

if (runif(1) < probab){ # If the probability we get is higher than a random probability
chain[i+1,] = proposal
} else {
chain[i+1,] = chain[i,]
}

Thanks again!

LikeLike

Reply
- Florian Hartig says:
  
  April 19, 2017 at 10:16 pm
  
  About the guessing – I often have a rough idea, but if you don’t know, would start with 1/20 of the prior or so.
  
  About the runif(1) < probab – yes, this is one option to code a draw from a Bernoulli trial, another option would be using rbinom(). As for any random number generation, there are many algorithms to do the job, and optimizing this is quite an art. I'm simply using this because this is a standard way to code this, and I wouldn't be surprised to see this code also in other software. I doubt this is a runtime-critical part of the code for any sampler.
  
  LikeLike
  
  Reply
Brent Mast says:

May 2, 2017 at 3:44 pm

Hi. You use a different seed for every iteration of the random uniform. I’m wondering why you didn’t use the same seed for each random variate. Thanks, Brent

LikeLike

Reply
- Florian Hartig says:
  
  May 2, 2017 at 4:32 pm
  
  Where do you see a different seed?
  
  LikeLike
  
  Reply
Jerry says:

May 10, 2017 at 4:09 am

Thank you for this concrete example of the code underlying MCMC. It was extremely helpful. It took me a while to “unpack” and understand your likelihood and prior functions, since I am not fully conversant with functions in R, but it was definitely worth the trouble.

LikeLike

Reply
JD says:

June 14, 2017 at 1:30 pm

Thank you for this example, extremely useful for a newbie to R and MCMC. Could I ask how you would use this type of model practically.

In my mind we could use MCMC to see reasonable acceptable results for y given the same x.

To that end, how could we use our chain results to produce a sequence of y? Or would we need to edit the functions to produce a chain of y values?

Please correct me if I have got this wrong!

LikeLike

Reply
- Florian Hartig says:
  
  July 1, 2017 at 5:30 pm
  
  Hi JD, sorry, I hadn’t checked the blog in a while, so this question was on hold forever. Anyway, I’m not 100% sure if I understand what you’re asking, but I suspect what you want to do are posterior predictive distributions (i.e. record the model predictions in the chain), or posterior predictive simulations (i.e. simulate new values based on the posterior predictive values in the chain). By chance, I have just written a post on residual checks via posterior predictive simulations, so maybe this helps https://theoreticalecology.wordpress.com/2017/07/01/bayesian-model-checking-via-posterior-predictive-simulations-bayesian-p-values-with-the-dharma-package/
  
  LikeLike
  
  Reply
MM says:

September 4, 2017 at 2:27 pm

Thanks, How can I add thinnng interval=3

LikeLike

Reply
- Florian Hartig says:
  
  September 30, 2017 at 3:12 pm
  
  Just record every third iterations of the MCMC in the chain. Or run everything, and then extract every third row of returned matrix.
  
  LikeLike
  
  Reply
Pingback: Metropolis Hastings algorithm for double sigmoidal curve – program faq
Jon Colado says:

September 19, 2018 at 4:24 pm

Terrific article. It’s so helpful to code an MCMC from scratch to really understand how it works.

Sorry to be a code nit, but you should really pass the y vector as an argument into the likelihood function. It took me a while to figure out what y was within that function. (Also, x is used globally but then again locally for slopevalues, which was a touch confusing. And “sd” is already a base R function – consider calling it stdev or something.)

The explanation of the mathematical process is really clear. Kudos on a great article.

LikeLike

Reply
- Florian Hartig says:
  
  March 1, 2019 at 8:45 am
  
  Hi Jon, thanks for the feedback, and sorry for a very late reply. About making L a function of D – we had the same discussion extensively for the https://github.com/florianhartig/BayesianTools package. The issue with doing this is that you then have to pass on the data through the MCMC as well, which I find creates more issues than solutions.
  
  You’re right about sd, should change that!
  
  LikeLike
  
  Reply
Pingback: #ENVR2018: A Recap – Sweet Tea, Science
jacob says:

February 28, 2019 at 11:19 pm

Hi, i’m fairly new to Bayesian and R, but i’m very studious in an attempt to climb this learning curve. Do you know of any great conceptual/intuitive books/video series that help teach R as it relates to bayesian?

LikeLike

Reply
- Florian Hartig says:
  
  March 1, 2019 at 8:43 am
  
  Hi Jacob, see at the end of here http://florianhartig.github.io/LearningBayes/
  
  LikeLike
  
  Reply
Caroline Mburu says:

May 30, 2019 at 4:23 pm

Thank you very much for this example.I wonder if you have any links to a simplified MCMC for deterministic SIR models?Where you are estimating gamma and beta?

LikeLike

Reply
- Florian Hartig says:
  
  May 30, 2019 at 6:55 pm
  
  Hi Caroline, I have no example at hand, but it’s a trivial change, you would just have to replace the a*x + b in the likelihood above with your SIR model, e.g. the one from this example http://rstudio-pubs-static.s3.amazonaws.com/6852_c59c5a2e8ea3456abbeb017185de603e.html
  
  However, if you go in fitting this type of models, I would recommend using https://cran.r-project.org/web/packages/BayesianTools/index.html , which has much more efficient MCMC samplers for complex dynamic models. Just replace the example model in the vignette with an SIR model such as the one above. If you have problems doing this, you can post a question at https://github.com/florianhartig/BayesianTools/issues
  
  LikeLike
  
  Reply
Caroline Mburu says:

May 31, 2019 at 6:46 am

Thank you very much for the quick response.Let me have a look at the links

LikeLike

Reply
Sae says:

August 30, 2019 at 9:47 pm

Thank you for the great example! One question: Could you use a dirichlet prior for any other parameters?

LikeLike

Reply
- Florian Hartig says:
  
  September 5, 2019 at 7:19 am
  
  Hi Sae! For a linear regression, a Dirichlet would be uncommon, but in principle you could of course replace the current prior by a Dirichlet. If you go into this sort of problems for real, you may want to use a more sophisticated sampler though, e.g. the one implemented in https://cran.r-project.org/web/packages/BayesianTools/index.html
  
  LikeLike
  
  Reply
Drinold says:

September 5, 2019 at 10:05 pm

thanks for great in debt on mcmc

LikeLike

Reply
akyp says:

October 15, 2019 at 10:13 pm

How about having only a prior of 1/sigma^2 for ε? I tried to make the algorithm work only using one prior on sigma, and I could not make it converge…

LikeLike

Reply
- Florian Hartig says:
  
  December 10, 2019 at 12:12 pm
  
  Sorry for the late reply – I assume you want to implement Jeffrey’s prior, right? In principle, that should be possible.
  
  LikeLike
  
  Reply
James Thomson says:

May 6, 2020 at 3:08 pm

I’ve been trying to get my head around Bayesian MCMC regression for a while. I’d grasped the basics of bayesian MCMC, but was struggling to apply this to regression. The following sentence really filled in the missing piece:

we have to return the probability of obtaining the test data above under this model (this sounds more complicated as it is, as you see in the code, we simply calculate the difference between predictions y = b + a*x and the observed y, and then we have to look up the probability densities (using dnorm) for such deviations to occur.

Thank you for your useful post!

LikeLike

Reply
Paul Yu says:

June 5, 2020 at 10:54 pm

Hi thank you for this incredible work, it does help me figure out the what’s underneath the hood of MCMC.

LikeLike

Reply
Pingback: A Slightly More Advanced MCMC Example – Data Science Austria
Pingback: A Slightly More Advanced MCMC Example | R-bloggers
Vitor Hugo Moreau says:

July 29, 2020 at 12:47 pm

Thanks for the excellent text. I suggest you to add “na.rm=T” at the line 8 of the loglikelihood script, in: “sumll = sum(singlelikelihoods, na.rm=T)”. I had some problems with NAs generation in my data. As well, I suggest also to remove the extra space between commas in the line 4 of the graphic script: “hist(chain[-(1:burnIn),1],nclass=30, , main=”Posterior of a”, xlab=”True value = red line” )”. It gives a warning message.
I would like to ask you if I can use part of your script in my paper and how I have to refer you text. Is there a public paper with it? Or I can cite this page. Thank you.

LikeLike

Reply
- Florian Hartig says:
  
  March 23, 2021 at 1:48 pm
  
  Hi Victor,
  
  thanks for your comments and sorry for the late reply, I somehow missed this comment.
  
  Of course you can use this code for the paper, but may I suggest that the sampler we implement here https://cran.r-project.org/web/packages/BayesianTools/index.html would be more powerful and also stable. In these samplers, we also check for NAs etc. The code here is just meant as a demo.
  
  For the same reason, I’d prefer to keep the code as it is, it is really just a demo to understand the principles, not to be adjusted for all kinds of models (in which case I recommend to use the more “professional” samplers implemented in various R packages).
  
  LikeLike
  
  Reply
Brian Davis says:

July 29, 2020 at 8:52 pm

Great post; Helping folks understand the black box of MCMC since 2012!

Curious…why not remove duplicate values from chain output? In other words, what’s the motivation for repeating parameters when proposals are not accepted?

LikeLike

Reply
- Florian Hartig says:
  
  March 23, 2021 at 1:41 pm
  
  Hi Brian, sorry, I had somehow missed this comment – in MCMC sampling, you want to get an idea about the shape of the target distribution. In the MH sampler, the point is that you will stay at good values longer until you accept a proposal, thus adding more of this “good” values (as duplicates) to the chain. This is why these good values will show up as more prominent in the final plots.
  
  LikeLike
  
  Reply
Trevor says:

March 3, 2021 at 3:17 pm

Great Post! You explained very well how the likelihood and prior distributions are used. One thing I’m not entirely clear on is the where the p(data) term in the denominator of Bayes equation is used?

LikeLike

Reply
- Florian Hartig says:
  
  March 3, 2021 at 6:30 pm
  
  Hi Trevor, p(data) can be expanded as p(data) = int dx p(data|x) d(x), and doing this we see that it is identical to the integral over the posterior. In other words, p(data) is just the normalization constant that is calculated when transforming the posterior samples into a density estimate.
  
  LikeLike
  
  Reply
Aditya says:

November 4, 2022 at 6:53 am

Great Post!

Did you use stan in any post for Bayesian data analytics on given same example?

If then please share that post.

LikeLike

Reply
- Florian Hartig says:
  
  December 1, 2022 at 8:22 pm
  
  This stuff is older than Stan but yes, I should add some Stan updates!
  
  LikeLike
  
  Reply
Isabella Stevens says:

December 29, 2022 at 6:42 pm

Hi there, I’m trying to do Metropolis Hastings on a time series of weather, giving a model of Y_t = Y_t-1 + N(0,1). Is this adaptable to that? Can I use the data I have to forecast future data using this method?

LikeLike

Reply
- Florian Hartig says:
  
  January 2, 2023 at 8:47 am
  
  Hello Isabella, yes, you can fit time series / state space models with MCMC, but for numerical efficiency, one would usually prefer sequential filters, such as the Kalman filter.
  
  LikeLike
  
  Reply

	Pavel Dodonov on Time to leave “species richnes…
	Friday links: long l… on Time to leave “species richnes…
	Donald Schoolmaster on Time to leave “species richnes…
	Florian Hartig on The end of the bias-variance t…
	David Hearn on The BayesianTools R package wi…
	Florian Hartig on The BayesianTools R package wi…
	David Hearn on The BayesianTools R package wi…

theoretical ecology

notes from ecology, evolution and statistics by Florian Hartig

A simple Metropolis-Hastings MCMC in R

95 thoughts on “A simple Metropolis-Hastings MCMC in R”

Leave a comment Cancel reply

Share this:

Related posts

95 thoughts on “A simple Metropolis-Hastings MCMC in R”

Leave a comment Cancel reply