I was tinkering with the function describe.simmap to try and speed it up, when I discovered that the very simple function getStates, which does nothing more than get the states on a mapped tree in memory for the nodes or tips, had run-time using lapply to iterate over the trees in an object of class "multiPhylo", that seems to rise non-linearly with the number of trees.
> tree<-pbtree(n=200,scale=1)
> Q<-matrix(c(-1,1,1,-1),2,2)
> colnames(Q)<-rownames(Q)<-letters[1:2]
> trees<-sim.history(tree,Q,nsim=200)
> # now lets get the states for all trees at all nodes
> # using sapply
> system.time(XX1<-getStates(trees[[1]]))
user system elapsed
0.02 0.00 0.02
> system.time(XX10<-sapply(trees[1:10],getStates))
user system elapsed
0.05 0.00 0.04
> system.time(XX50<-sapply(trees[1:50],getStates))
user system elapsed
0.28 0.00 0.29
> system.time(XX100<-sapply(trees[1:100],getStates))
user system elapsed
1.16 0.00 1.16
> system.time(XX200<-sapply(trees,getStates))
user system elapsed
6.6 0.0 6.6
Hmmm. What's going on here?
What I discovered (somehow - I'm not sure why I tried this) is that if I first split my list of 200 trees into, say, 20 lists of 10 trees; and then I ran sapply(...,getStates) on each of these lists; then recombined the results using cbind, this is much faster. So, for instance:
ff<-as.factor(ceiling(1:length(trees)/10))
aa<-lapply(split(trees,ff),function(x) sapply(x,getStates))
y<-if(length(trees)>10) aa[[1]] else aa
for(i in 2:length(aa)) y<-cbind(y,aa[[i]])
y
}
> # now run it
> system.time(YY200<-g(trees))
user system elapsed
0.97 0.00 0.97
> # check all equal
> dim(YY200)
[1] 199 200
> dim(XX200)
[1] 199 200
> all(XX200==YY200)
[1] TRUE
Seriously - what's going on here? (I have some ideas - but I'm not sure. Feedback welcome.)
Here's a new version of getStates with this hack implemented internally:
# written by Liam J. Revell 2013
getStates<-function(tree,type=c("nodes","tips")){
type<-type[1]
if(class(tree)=="multiPhylo"){
ff<-as.factor(ceiling(1:length(tree)/10))
aa<-lapply(split(tree,ff),function(x)
sapply(x,getStates))
y<-if(length(tree)>10) aa[[1]] else aa
for(i in 2:length(aa)) y<-cbind(y,aa[[i]])
} else if(class(tree)=="phylo"){
if(type=="nodes"){
y<-setNames(sapply(tree$maps,function(x)
names(x)[1]),tree$edge[,1])
y<-y[as.character(length(tree$tip)+1:tree$Nnode)]
} else if(type=="tips"){
y<-setNames(sapply(tree$maps,function(x)
names(x)[length(x)]),tree$edge[,2])
y<-setNames(y[as.character(1:length(tree$tip))],
tree$tip)
}
} else stop("tree should be an object of class 'phylo' or 'multiPhylo'")
return(y)
}
Hi Liam,
ReplyDeletethat is not really unexpected. R copies a lot during subsetting objects. lapply and therefore also sapply work much better on pure lists.
Here is a trick sometimes used in ape and phangorn with your original getStates function:
> trees2 = unclass(trees)
> system.time(X1 <- sapply(trees, getStates))
user system elapsed
10.572 0.000 10.591
> system.time(X2 <- sapply(trees2, getStates))
user system elapsed
0.356 0.000 0.354
> all.equal(X1, X2)
[1] TRUE
You can compare the different memory usage, if you run this code:
Rprof(tmp <- tempfile(), memory.profiling=TRUE)
X1 <- sapply(trees, getStates)
Rprof()
summaryRprof(tmp, memory="both")
unlink(tmp)
Rprof(tmp <- tempfile(), memory.profiling=TRUE)
X2 <- sapply(trees2, getStates)
Rprof()
summaryRprof(tmp, memory="both")
unlink(tmp)
Cheers,
Klaus
Hi Klaus. This is terrific.
DeleteI'm blown that removing the class attribute would have this effect! lapply documentation says classed objects will be coerced using as.list....
Also don't know why there is any performance improvement using split - since split objects still have class "multiPhylo".
I will definitely use this, Klaus. Thanks!
- Liam
Ok - I understand why split helps, it is because run-time increases more than linearly with the number of trees, and by splitting our object into an unclassed list of lists we cut off a large portion of that more than linear increase.
DeleteHowever, I don't entirely understand why assigning a class attribute to what is really just a simple list affects lapply & sapply so much.
Thanks again for the trick. I have now used it in the phytools function describe.simmap, countSimmap, as well as getStates, to great effect.
- Liam
Dear Liam,
ReplyDeleteI was glad to lay hands on your tutorials but I was blocked midway.
I wanted to pull out the tips from data state with the function "getStates" and I consistently have this bug:
"Error in setNames(sapply(tree$maps, function(x) names(x)[length(x)]), : 'names' attribute [166] must be the same length as the vector [0]"
Is there a previous command which you applied to construct your "anoletree" in the example?
Thanks