Monday, April 29, 2013

2x faster version of make.simmap for Q="mcmc"

In a previous post I promised that I could speed up make.simmap(..., Q="mcmc"), i.e., the full Bayesian MCMC make.simmap method. (Matt Pennell said he didn't care, which is very generous of him.)

There was one really low-hanging fruit which is that during Bayesian MCMC sampling of the transition matrix, Q, make.simmap was unnecessarily computing the likelihood twice for each iteration of the MCMC. This is because make.simmap uses modified code from ape's ace function internally to perform Felsenstein's pruning algorithm in computing the likelihood of Q as well as well as the conditional scaled likelihoods at each node. Normally, ace returns the likelihood & the scaled conditional likelihoods of the subtrees at each internal node by first maximizing the likelihood using the optimizer nlminb, and the computing the conditional likelihoods given MLE(Q). I modified the code to take a fixed value of Q, however it still returned both the overall log-likelihood & the conditional likelihoods via two rounds of the pruning algorithm. For the vast majority of generations in our MCMC we don't care about the conditional scaled likelihoods (because we're not sampling that generation) so we should only compute the overall log-likelihood which we need to compute the posterior odds ratio and make a decision about whether to retain the updated parameter values or not. Whenever we want to sample a particular value for Q (i.e., every samplefreq generations), we can compute the conditional likelihoods for all the nodes - since we'll need these later.

This improvement has a pretty dramatic effect on computation time - pretty much halving it as you might expect. Here's a demo using my not-too-fast i5 desktop:

> # simulate tree & data
> Q<-matrix(c(-1,1,1,-1),2,2)
> colnames(Q)<-rownames(Q)<-LETTERS[1:2]
> tree<-sim.history(pbtree(n=100,scale=1),Q)
> x<-tree$states
> # number of changes, etc., on the true tree
> describe.simmap(tree)
1 tree with a mapped discrete character with states:
 A, B

tree has 21 changes between states

changes are of the following types:
  A  B
A  0 11
B 10  0

mean total time spent in each state is:
              A          B    total
raw  11.5075650 10.5967548 22.10432
prop  0.5206025  0.4793975  1.00000

> # OK, now let's run make.simmap & time it
> system.time(mtrees.old<-make.simmap(tree,x,Q="mcmc", nsim=200,prior=list(beta=2,use.empirical=TRUE)))
Running MCMC burn-in. Please wait....
Running 20000 generations of MCMC, sampling every 100 generations. Please wait....
make.simmap is simulating with a sample of Q from the posterior distribution
Mean Q from the posterior is
Q =
          A        B
A -1.205146  1.205146
B  1.205146 -1.205146
and (mean) root node prior probabilities
pi =
[1] 0.5 0.5
Done.
  user  system elapsed
398.44    0.23  401.99
> # pretty slow!!
> describe.simmap(mtrees.old)
200 trees with a mapped discrete character with states:
 A, B

trees have 29.93 changes between states on average

changes are of the following types:
        A,B    B,A
x->y 15.055 14.875

mean total time spent in each state is:
              A          B    total
raw  11.3629822 10.7413376 22.10432
prop  0.5140616  0.4859384  1.00000

OK - now let's compare to the updated version:

> source("make.simmap.R")
> system.time(mtrees.new<-make.simmap(tree,x,Q="mcmc", nsim=200,prior=list(beta=2,use.empirical=TRUE)))
Running MCMC burn-in. Please wait....
Running 20000 generations of MCMC, sampling every 100 generations. Please wait....
make.simmap is simulating with a sample of Q from the posterior distribution
Mean Q from the posterior is
Q =
          A        B
A -1.227405  1.227405
B  1.227405 -1.227405
and (mean) root node prior probabilities
pi =
[1] 0.5 0.5
Done.
  user  system elapsed
214.87    0.03  216.36
> # still slow, but better
> describe.simmap(mtrees.new)
200 trees with a mapped discrete character with states:
 A, B

trees have 29.975 changes between states on average

changes are of the following types:
        A,B  B,A
x->y 14.585 15.39

mean total time spent in each state is:
              A          B    total
raw  11.1900994 10.9142203 22.10432
prop  0.5062404  0.4937596  1.00000

The code for this version is here, but I have also posted a new phytools build (phytools 0.2-54). Check it out & please report any issues or bugs.

No comments:

Post a Comment