Tuesday, May 7, 2013

Faster version of getStates, but why is it faster...?

I was tinkering with the function describe.simmap to try and speed it up, when I discovered that the very simple function getStates, which does nothing more than get the states on a mapped tree in memory for the nodes or tips, had run-time using lapply to iterate over the trees in an object of class "multiPhylo", that seems to rise non-linearly with the number of trees.

> ## simulate tree & data
> tree<-pbtree(n=200,scale=1)
> Q<-matrix(c(-1,1,1,-1),2,2)
> colnames(Q)<-rownames(Q)<-letters[1:2]
> trees<-sim.history(tree,Q,nsim=200)
> # now lets get the states for all trees at all nodes
> # using sapply
> system.time(XX1<-getStates(trees[[1]]))
  user  system elapsed
  0.02    0.00    0.02
> system.time(XX10<-sapply(trees[1:10],getStates))
  user  system elapsed
  0.05    0.00    0.04
> system.time(XX50<-sapply(trees[1:50],getStates))
  user  system elapsed
  0.28    0.00    0.29
> system.time(XX100<-sapply(trees[1:100],getStates))
  user  system elapsed
  1.16    0.00    1.16
> system.time(XX200<-sapply(trees,getStates))
  user  system elapsed
    6.6    0.0    6.6

Hmmm. What's going on here?

What I discovered (somehow - I'm not sure why I tried this) is that if I first split my list of 200 trees into, say, 20 lists of 10 trees; and then I ran sapply(...,getStates) on each of these lists; then recombined the results using cbind, this is much faster. So, for instance:

> g<-function(trees){
 ff<-as.factor(ceiling(1:length(trees)/10))
 aa<-lapply(split(trees,ff),function(x)   sapply(x,getStates))
 y<-if(length(trees)>10) aa[[1]] else aa
 for(i in 2:length(aa)) y<-cbind(y,aa[[i]])
 y
}
> # now run it
> system.time(YY200<-g(trees))
  user  system elapsed
  0.97    0.00    0.97
> # check all equal
> dim(YY200)
[1] 199 200
> dim(XX200)
[1] 199 200
> all(XX200==YY200)
[1] TRUE

Seriously - what's going on here? (I have some ideas - but I'm not sure. Feedback welcome.)

Here's a new version of getStates with this hack implemented internally:

# function to get node states from simmap style trees
# written by Liam J. Revell 2013
getStates<-function(tree,type=c("nodes","tips")){
  type<-type[1]
  if(class(tree)=="multiPhylo"){
    ff<-as.factor(ceiling(1:length(tree)/10))
    aa<-lapply(split(tree,ff),function(x)
     sapply(x,getStates))
    y<-if(length(tree)>10) aa[[1]] else aa
    for(i in 2:length(aa)) y<-cbind(y,aa[[i]])
  } else if(class(tree)=="phylo"){
    if(type=="nodes"){
      y<-setNames(sapply(tree$maps,function(x)
       names(x)[1]),tree$edge[,1])
      y<-y[as.character(length(tree$tip)+1:tree$Nnode)]
    } else if(type=="tips"){
      y<-setNames(sapply(tree$maps,function(x)
       names(x)[length(x)]),tree$edge[,2])
      y<-setNames(y[as.character(1:length(tree$tip))],
       tree$tip)
    }
  } else stop("tree should be an object of class 'phylo' or 'multiPhylo'")
  return(y)
}

4 comments:

  1. Hi Liam,

    that is not really unexpected. R copies a lot during subsetting objects. lapply and therefore also sapply work much better on pure lists.
    Here is a trick sometimes used in ape and phangorn with your original getStates function:


    > trees2 = unclass(trees)
    > system.time(X1 <- sapply(trees, getStates))
    user system elapsed
    10.572 0.000 10.591
    > system.time(X2 <- sapply(trees2, getStates))
    user system elapsed
    0.356 0.000 0.354
    > all.equal(X1, X2)
    [1] TRUE

    You can compare the different memory usage, if you run this code:

    Rprof(tmp <- tempfile(), memory.profiling=TRUE)
    X1 <- sapply(trees, getStates)
    Rprof()
    summaryRprof(tmp, memory="both")
    unlink(tmp)

    Rprof(tmp <- tempfile(), memory.profiling=TRUE)
    X2 <- sapply(trees2, getStates)
    Rprof()
    summaryRprof(tmp, memory="both")
    unlink(tmp)


    Cheers,
    Klaus

    ReplyDelete
    Replies
    1. Hi Klaus. This is terrific.

      I'm blown that removing the class attribute would have this effect! lapply documentation says classed objects will be coerced using as.list....

      Also don't know why there is any performance improvement using split - since split objects still have class "multiPhylo".

      I will definitely use this, Klaus. Thanks!

      - Liam

      Delete
    2. Ok - I understand why split helps, it is because run-time increases more than linearly with the number of trees, and by splitting our object into an unclassed list of lists we cut off a large portion of that more than linear increase.

      However, I don't entirely understand why assigning a class attribute to what is really just a simple list affects lapply & sapply so much.

      Thanks again for the trick. I have now used it in the phytools function describe.simmap, countSimmap, as well as getStates, to great effect.

      - Liam

      Delete
  2. Dear Liam,
    I was glad to lay hands on your tutorials but I was blocked midway.

    I wanted to pull out the tips from data state with the function "getStates" and I consistently have this bug:

    "Error in setNames(sapply(tree$maps, function(x) names(x)[length(x)]), : 'names' attribute [166] must be the same length as the vector [0]"

    Is there a previous command which you applied to construct your "anoletree" in the example?


    Thanks

    ReplyDelete