In the midst of all the craziness that is March of 2020, for some reason over the
past few days I have been trying to improve the phytools function
mcmcMk
, just to see if I can possibly make it useful for some phytools users.
mcmcMk
does MCMC for the Mk model - but note that this can also been done
using the package diversitree, and (in fact) this is what I would generally
recommend.
Nonetheless, I have managed to make mcmcMk
kind of cool - so it could be worth
checking out.
The first thing I had to do was address some issues with the default prior distribution. I had been using an exponential distribution - but this turns out to be a really terrible idea. I have now switched to a Gamma distribution, which works much, much better.
I also (2) let the parsimony score for the trait inform the default prior. In particular, I set the default value of \(\alpha\)/\(\beta\) (the mean from the \(\Gamma\) distribution) to be the parsimony score divded by the total tree length. The default value of \(\alpha\) is (for now) 0.1, which corresponds to a pretty flat distribution (though you'll see that I will use an even lower value of \(\alpha\) here).
The other thing (3) I was able to accomplish was to speed up the likelihood
calculation - which in turn speeds up the MCMC. I had been doing this using
fitMk
in phytools for a fixed value of the transition matrix Q. I
realized that it should run faster if I just modified fitMk
to export the
likelihood function, and then used this likelihood function in the MCMC
calculations.
In addition to this (4), I also tested geiger::fitDiscrete
which
already exported it's likelihood function. It turns out that fitDiscrete
's
likelihood calculation is significantly faster than fitMk
. As such, I have
now added this method as the default for computing the likelihood - but users
can still use fitMk
if they want. (They'll need to, for instance, if they
don't have geiger installed or if they want to input their data for x
as
a probability matrix.)
Finally (5), I added the option (also the default) to auto-tune the proposal distribution to target a user-specified acceptance rate for the MCMC. This value defaults to 0.5, but can be adjusted up or down by the user. (For a one-parameter model, a good target acceptance rate is 0.44; for more complex models it should be lower.)
Let's test it out with some simple data:
library(phytools)
packageVersion("phytools")
## [1] '0.7.23'
## here's our data
tree
##
## Phylogenetic tree with 100 tips and 99 internal nodes.
##
## Tip labels:
## t7, t44, t80, t81, t21, t18, ...
##
## Rooted; includes branch lengths.
x
## t7 t44 t80 t81 t21 t18 t13 t84 t85 t11 t12 t2 t4 t78 t79 t27
## 0 1 1 1 1 1 0 0 0 1 1 0 1 1 1 1
## t36 t70 t71 t42 t43 t14 t15 t73 t74 t69 t5 t1 t52 t53 t97 t98
## 1 1 1 0 1 1 1 0 1 0 0 1 0 1 1 1
## t47 t48 t75 t76 t72 t28 t35 t37 t93 t94 t46 t99 t100 t30 t56 t90
## 1 1 1 1 1 1 0 1 0 0 0 0 0 1 0 0
## t91 t50 t31 t23 t24 t61 t62 t67 t68 t20 t88 t89 t77 t54 t55 t39
## 1 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1
## t63 t64 t57 t58 t45 t3 t34 t86 t87 t59 t60 t51 t25 t26 t82 t83
## 1 1 1 1 0 1 1 1 1 0 0 0 1 1 1 1
## t32 t33 t9 t10 t19 t65 t66 t29 t6 t8 t22 t49 t95 t96 t92 t16
## 0 1 1 1 0 1 1 1 0 0 0 1 1 1 1 1
## t17 t38 t40 t41
## 1 0 1 0
## Levels: 0 1
mcmc<-mcmcMk(tree,x,ngen=2000,model="ER",prior=list(alpha=0.001))
## Running MCMC....
## gen [1,0] logLik accept
## 1 0.2956 -67.298
## 101 0.757 -62.0417 0.59
## 201 0.8819 -61.9181 0.6
## 301 0.6133 -62.529 0.5
## 401 1.5944 -62.9026 0.55
## 501 1.2939 -62.3513 0.56
## 601 1.2689 -62.3083 0.54
## 701 0.766 -62.0253 0.5
## 801 0.8051 -61.9692 0.49
## 901 0.7627 -62.0312 0.5
## 1001 1.4013 -62.544 0.57
## 1101 1.091 -62.0426 0.61
## 1201 1.1005 -62.0543 0.5
## 1301 2.497 -64.3112 0.44
## 1401 1.4033 -62.5477 0.46
## 1501 0.5925 -62.6448 0.53
## 1601 0.9704 -61.9339 0.54
## 1701 0.5553 -62.8899 0.44
## 1801 0.6921 -62.2035 0.47
## 1901 1.4791 -62.6882 0.47
## Done.
Now, let's compare it to our fitted ML model:
summary(mcmc)
## Assuming 20% burn-in as no burn-in was specified....
##
## Mean value of Q from the post burn-in posterior sample:
## 0 1
## 0 -1.037971 1.037971
## 1 1.037971 -1.037971
##
## Median value of Q from the post burn-in posterior sample:
## 0 1
## 0 -0.9082315 0.9082315
## 1 0.9082315 -0.9082315
##
## 95% HPD interval computed either from the post burn-in
## samples or using 'coda':
## lower upper
## [1,0] 0.3608725 1.910995
fitMk(tree,x,model="ER")
## Object of class "fitMk".
##
## Fitted (or set) value of Q:
## 0 1
## 0 -0.904984 0.904984
## 1 0.904984 -0.904984
##
## Fitted (or set) value of pi:
## 0 1
## 0.5 0.5
##
## Log-likelihood: -61.915559
##
## Optimization method used was "nlminb"
Or, perhaps more meaningful, let's compare our posterior distribution on the transition rate, q, to the likelihood surface. Since our prior distribution was chosen to be quite uninformative, these should be similar:
par(mfrow=c(2,1))
plot(density(mcmc))
## Assuming 20% burn-in as no burn-in was specified....
## to compute our likelihood surface:
Mk.surf<-function(tree,x,range.q){
m<-length(unique(x))
Q<-matrix(1,m,m,dimnames=list(levels(x),levels(x)))
diag(Q)<--rowSums(Q)+1
index.matrix<-Q
diag(index.matrix)<-rep(NA,m)
lik.f<-fitMk(tree,x,model="ER")$lik
q<-seq(range.q[1],range.q[2],length.out=100)
logL<-sapply(q,function(q,Q) lik.f(q*Q),Q=Q)
plot(q,exp(logL),type="l",bty="l",
main="likelihood surface for q",font.main=1)
}
Mk.surf(tree,x,range(density(mcmc)$Density$x))
## Assuming 20% burn-in as no burn-in was specified....
That's pretty cool so far.
Now, for fun, let's try an “ARD” model:
mcmc<-mcmcMk(tree,x,ngen=2000,model="ARD",prior=list(alpha=0.1),
auto.tune=0.3)
## Running MCMC....
## gen [1,0] [0,1] logLik accept
## 1 0.2956 0.2956 -67.298
## 101 0.6481 1.3574 -59.8223 0.4
## 201 0.636 1.5911 -60.5524 0.4
## 301 0.6464 0.7386 -61.3897 0.11
## 401 1.3343 3.2096 -61.7304 0.38
## 501 1.4642 2.3867 -60.42 0.49
## 601 1.6126 2.0809 -61.1757 0.35
## 701 0.607 0.7238 -61.3178 0.18
## 801 0.6791 1.2245 -59.5928 0.24
## 901 1.6073 3.1604 -61.0521 0.25
## 1001 0.8206 1.5465 -59.5629 0.31
## 1101 1.3371 2.1958 -60.1967 0.24
## 1201 1.1337 1.7848 -59.8579 0.3
## 1301 0.8941 1.0917 -60.4888 0.26
## 1401 1.1789 2.1283 -60.0104 0.3
## 1501 0.6855 1.4508 -59.8168 0.26
## 1601 1.0183 1.0874 -61.4304 0.41
## 1701 0.6388 1.8892 -61.7907 0.23
## 1801 0.4876 1.2122 -60.8139 0.22
## 1901 0.589 1.1416 -59.8566 0.29
## Done.
(This probably needs to run longer - but you get the idea.)
summary(mcmc)
## Assuming 20% burn-in as no burn-in was specified....
##
## Mean value of Q from the post burn-in posterior sample:
## 0 1
## 0 -1.4867913 1.4867913
## 1 0.8876883 -0.8876883
##
## Median value of Q from the post burn-in posterior sample:
## 0 1
## 0 -1.325659 1.325659
## 1 0.777271 -0.777271
##
## 95% HPD interval computed either from the post burn-in
## samples or using 'coda':
## lower upper
## [1,0] 0.2959638 1.805645
## [0,1] 0.4251765 3.080158
fitMk(tree,x,model="ARD")
## Object of class "fitMk".
##
## Fitted (or set) value of Q:
## 0 1
## 0 -1.394135 1.394135
## 1 0.804726 -0.804726
##
## Fitted (or set) value of pi:
## 0 1
## 0.5 0.5
##
## Log-likelihood: -59.489243
##
## Optimization method used was "nlminb"
plot(density(mcmc))
## Assuming 20% burn-in as no burn-in was specified....
That's kind of cool, right?
The data for this exercise were simulated as follows:
x<-sim.Mk(tree<-pbtree(n=100),Q<-matrix(c(-2,2,1,-1),2,2,
dimnames=list(0:1,0:1),byrow=TRUE))
No comments:
Post a Comment
Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.