Monday, April 22, 2013

New, much improved version of read.simmap

An issue with the phytools function read.simmap that was identified a long time ago is that, for some heretofore unclear reason, computation time rises more than linearly with the number of trees in the input file.

Well, this finally got annoying enough for me to look into it. What I ultimately discovered surprised me. Basically, it seems to come down to using a for loop around the code that parses the text string into a "phylo" object, rather than a function in the apply family (in this case, lapply). The reason I programmed it this way originally was historical rather than practical in nature - and I wouldn't do it this way today; however it had not occurred to me that this might be the fatal flaw that underlay the weird performance results for the function. Before identifying this fix, I made a number of different updates that improved the code significantly - however none of these fixed the non-linear performance issue until I re-wrote with lapply standing in where once there had been for.

I decided to do a bunch of other things to overhaul the function, and the new version is here. I've also posted a new phytools build (phytools 0.2-52) with this update.

The performance improvement is something else. Here's a quick demo with a 1000-tree dataset in the file 1000tree.nex. First - the latest CRAN build of phytools:

> require(phytools)
Loading required package: phytools
> packageVersion("phytools")
[1] ‘0.2.50’
> system.time(AA<-read.simmap("1000trees.nex",format="nexus", version=1.0))
  user  system elapsed
 928.19    1.03  941.15

(Yes, that did take 15 minutes!!) Now, let's try the new version (phytools 0.2-52):

> require(phytools)
Loading required package: phytools
> packageVersion("phytools")
[1] ‘0.2.52’
> system.time(BB<-read.simmap("1000trees.nex",format="nexus", version=1.0))
  user  system elapsed
  33.38    0.24  34.52

Ridiculous!

> layout(matrix(1:2,1,2))
> plotSimmap(AA[[1]],pts=F,lwd=3,ftype="off")
no colors provided. using the following legend:
      0      1
"black"  "red"
> plotSimmap(BB[[1]],pts=F,lwd=3,ftype="off")
no colors provided. using the following legend:
      0      1
"black"  "red"
You get the idea.

This is a new, major overhaul - so please report any issues so that I can fix them ASAP.

1 comment:

  1. My neighbor Jarrett Byrnes just pointed out that this is likely because lapply pre-allocates the list. Thus if we did:

    trees<-vector("list",length=Ntree)

    and then ran the for loop instead of lapply it would be the same (or, at least, eliminate the problem of computation time rises exponentially with the number of trees being parsed).

    It looks like he's correct.

    ReplyDelete