Monday, December 26, 2011

Pure-birth trees, fast

To do some simulations for the project that I'm working on, I needed to do some Yule (pure-birth) tree simulations, fast. The geiger package has a great function, birthdeath.tree, that I would normally use - but since this function does birth-death tree simulations, for large pure-birth phylogenies it is slower than it needs to be. I have just posted a new phytools function, pbtree, that does pure-birth tree simulations pretty quickly. Presently, it only simulates for a fixed number of tips; but it will also rescale the tree to have an arbitrary total length. Finally, it stops at the nth speciation event (for n taxa), rather than at the (n-1)th event, which is nice because it means that the branches descending from the last speciation event do not have length zero. A direct link to the code for this function is here. I have also included the function in the latest, non-posted, non-static development version of phytools (you can download it here and install from source).

The function was simple enough to write. Basically, the way it works is it first calculates the number of edges in a pure-birth tree of size n. Then, it iterates up from the root of the tree. Each time an edge is to be added, it first draws a random value from an exponential distribution with mean m/b, where m is the number of species currently extant at the time of addition; and b is the birth-rate. It then adds the edge at the end of a randomly chosen edge that does not already end in a bifurcation. I kept track of this by adding new rows to $edge with the convention that the tipward end of that edge was labeled NA until a descendant was attached. In addition, the new random exponential edge length is added not only to the parent of the new edge, but also to all "open" edges on the tree (that is, edges that have not been closed by a bifurcation event).

Just to illustrate the speed advantage of this function for pure-birth trees, try the following:

> require(phytools)
> source("pbtree.R")
> system.time(tr1<-pbtree(n=10000))
   user  system elapsed
  10.62    0.00   10.62
> require(geiger)
> system.time(tr2<-birthdeath.tree(b=1,d=0,taxa.stop=10000))
   user  system elapsed
  41.46    0.03   41.53


The other thing that I like about this function is that you can control the seed for the random simulations using the base function set.seed. This is not true for birthdeath.tree, which seeds from the clock. So, for instance:

> set.seed(1); tr1<-pbtree(n=100)
> set.seed(1); tr2<-pbtree(n=100)
> all.equal(tr1,tr2)
[1] TRUE
> set.seed(1); tr1<-birthdeath.tree(b=1,d=0,taxa.stop=100)
> set.seed(1); tr2<-birthdeath.tree(b=1,d=0,taxa.stop=100)
> all.equal(tr1,tr2)
[1] FALSE


The function birthdeath.tree does have an option to set the seed, but this is less convenient when trying to reproduce whole simulation analyses. (In addition, I wasn't able to get this option to work - but perhaps I was doing something wrong.)

Also in the phytools version linked above is a new version of evolvcv.lite which also returns the numeric value for convergence produced by optim.

2 comments:

  1. Hey Liam - Have you tried Tanja Standler's
    TreeSim package? It has superfast functions for Yule and BD trees that can be conditioned on N, T or both.


    Graham

    ReplyDelete
  2. Nope, although I'm aware of it.

    Writing this function was also an exercise in figuring out the problem - and it simply turned out that (for simple Yule trees) the resulting function turned out to be fairly efficient. It also gave me the selfish opportunity to include the options (stopping at the (n-1)th event from the root, rather than at the nth event, and rescaling to an arbitrary total length).

    Thanks for the comment. I'm looking forward to catching up in Charleston (at SICB) next week!

    ReplyDelete