Phylogenetic Tools for Comparative Biology: Bug fix in make.simmap(...,Q="mcmc")

Thursday, May 30, 2013

Bug fix in make.simmap(...,Q="mcmc")

I just posted a new version of make.simmap and a new phytools build (phytools 0.2-77). This version fixes a bug affecting make.simmap(...,Q="mcmc"). In this method, the transition matrix Q is sampled from its Bayesian posterior probability distribution using MCMC given the model & data. This sample is then used by make.simmap to map characters on the tree.

The bug was not in the MCMC itself which (so far as I can tell) is properly designed, but in how Q was stored in sampling generations - specifically, the updated value of Q for that generation was always stored. The reason this is not a bug in the MCMC is because this value was only returned to the chain with probability equal to the posterior odds ratio - thus this is only about the value that is stored from the chain, not the behavior of the chain itself. This was somewhat difficult to detect because it will not be obvious unless the variance on the proposal distribution is high relative to the curvature of the likelihood surface or the prior density. (In case it's not obvious why this is, this is because if the proposal variance is low - most post burn-in samples will have relatively high posterior odds and will thus have a good chance of being accepted; whereas if the proposal variance is high, most samples will have low posterior odds.)

I also changed the starting value of Q for the MCMC. Previously, I had arbitrarily set all the non-diagonal elements of Q to a fixed constant. Now I draw a set of values at random from the prior probability density on Q, as provided by the user. The advantage of this is because if we set a very strong prior on Q, our MCMC may have difficulty converging on the region of high posterior density if the variance on our proposal distribution is too low or (especially) high.

I'm not sure what a good proposal variance is - but one way of thinking about it is relative to the empirical Q. For instance, if the non-diagonal of our empirical Q are all around ~0.1, then it is probably not a good idea to vQ = 10. Unless our data contain very little information about Q, almost all samples will be rejected and the MCMC will be very inefficient at exploring the posterior distribution of Q. Conversely, if the non-diagonal of our empirical Q average > 100, then we should probably not choose vQ = 0.001. In this case, if we start anywhere near the ML of Q - and unless we have very little information about Q in our data - almost all samples will be accepted, which is also a bad way to sample from the posterior using MCMC.

Even though make.simmap is not set up for this, it is possible to do some diagnoses on our MCMC using the MCMC diagnostics package coda. For example, let's say we have obtained 100 samples of Q (and thus 100 stochastic mapped trees) from the posterior after burnin

mtrees<-make.simmap(tree,x,Q="mcmc",vQ=0.01,prior= list(use.empirical=TRUE,beta=2))

we can get the likelihoods using

logL<-sapply(unclass(mtrees),function(x) x$logL)

or (for instance), the posterior sample of Q_1,2 using

q12<-sapply(unclass(mtrees),function(x) x$Q[1,2])

and then perform diagnostics (effective size, rejection rate, etc.) using the appropriate coda functions. To increase effective size without increasing the number of sampled trees, we can increase the sample frequency (samplefreq) and increase or decrease the proposal variance (vQ).

6 comments:

Liam RevellMay 30, 2013 at 1:06 PM
BTW - thanks to Rich Glor for reporting suspicious results that led to the discovery of this bug. - Liam
ReplyDelete
Replies
MCarvalhoMay 30, 2013 at 7:02 PM
Hi Liam,

Do you know of any work/literature on choosing a prior distribution for *Q* ?

Best,

Luiz
ReplyDelete
Replies

Add comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.

Pages

Thursday, May 30, 2013

Bug fix in make.simmap(...,Q="mcmc")

6 comments: