Wednesday, May 8, 2013

Much faster versions of countSimmap, describe.simmap, and getStates

Yesterday, Klaus Schliep pointed out that a trick to speed up lapply's handling of objects of class "multiPhylo" (i.e., just a simple list of objects of class "phylo") was just to first remove the class attribute "multiPhylo". Why this would have any effect at all is somewhat of a mystery to me. is.list(trees) evaluates TRUE regardless of whether or not the class attribute has been removed, so it doesn't seem that lapply would have to coerce our object to a list in either case. Nonetheless - not only does this work, it works tremendously! I have now included this simple trick in countSimmap, describe.simmap, and getStates, all of which use lapply or sapply if the argument tree is an object of class "multiPhylo".

Here's a demo of just how much of an improvement in speed results from this trick:

> require(phytools)
Loading required package: phytools
> packageVersion("phytools")
[1] ‘0.2.56’
>
> # simulate data
> tree<-pbtree(n=100,scale=1)
> Q<-matrix(c(-1,1,1,-1),2,2)
> colnames(Q)<-rownames(Q)<-letters[1:2]
> tree<-sim.history(tree,Q)
> x<-tree$states
>
> # stochastic mapping
> mtrees<-make.simmap(tree,x,nsim=1000)
make.simmap is sampling character histories conditioned on the transition matrix
Q =
          a          b
a -0.8686634  0.8686634
b  0.8686634 -0.8686634
(estimated using likelihood);
and (mean) root node prior probabilities
pi =
  a  b
0.5 0.5
Done.
>
> # ok, now let's time describe.simmap for
> # various subsets of our mapped trees
> system.time(X100<-describe.simmap(mtrees[1:100], message=FALSE))
  user  system elapsed
  2.69    0.02    2.70
> system.time(X200<-describe.simmap(mtrees[1:200], message=FALSE))
  user  system elapsed
  12.71    0.02  12.75
> system.time(X400<-describe.simmap(mtrees[1:400], message=FALSE))
  user  system elapsed
  74.24    0.64  75.57

Woah. I'm not even going to try the full set of 1,000 trees.

OK, now let's compare to describe.simmap with nothing more than the trick suggested by Klaus (i.e., unclass-ing the "multiPhylo" object for every use of lapply or sapply):

> require(phytools)
Loading required package: phytools
> packageVersion("phytools")
[1] ‘0.2.58’
> system.time(Y100<-describe.simmap(mtrees[1:100], message=FALSE))
  user  system elapsed
  0.45    0.00    0.45
> system.time(Y200<-describe.simmap(mtrees[1:200], message=FALSE))
  user  system elapsed
  0.83    0.00    0.82
> system.time(Y400<-describe.simmap(mtrees[1:400], message=FALSE))
  user  system elapsed
  1.78    0.00    1.78
Holy cow! What a huge improvement. We can even now run it on the full set of 1,000 mapped trees:
> par(cex=0.8) # make our tip labels a little smaller
> system.time(YY<-describe.simmap(mtrees,plot=TRUE))
1000 trees with a mapped discrete character with states:
a, b

trees have 29.633 changes between states on average

changes are of the following types:
        a,b    b,a
x->y 10.692 18.941

mean total time spent in each state is:
              a          b    total
raw  12.9314025 18.3673136 31.29872
prop  0.4131608  0.5868392  1.00000

  user  system elapsed
  4.65    0.06    4.71

Wow.

1 comment:

  1. I should have noted that this is in a new minor release of phytools (phytools 0.2-58). - Liam

    ReplyDelete