Sunday, April 14, 2013

Simulating trees conditioned on taxa & time with 'direct' sampling

The day before yesterday I posted a new version of pbtree that could simulate first pure-birth & then birth-death phylogenies conditioned on both the total number of tips & the total time via rejection sampling. That is, it simulates trees for total time t and then rejects any phylogeny that doesn't have the correct number of tips, N.

Well, as I was out walking & eating a cookie today, it suddenly occurred to me how we might sample birth-death trees conditioned on both N & t 'directly' (rather than via rejection). It would be by the following two step procedure:

1. Simulate a set of wait-times to total time t. For each wait time, we also need to compute a birth or death event and keep count of the births minus the deaths (m) so that we can draw our exponential waiting times with λ = m × (b + d). Repeat until the number of lineages, m, at t is equal to N.

2. With this set of waiting-times & birth or death events in hand, simulate the tree.

By simulating the waiting-times first, followed by the tree, we know before we start that our tree will be of length t with N taxa. This should reduce computation time dramatically.

I've posted an experimental version of this here. The following is a demo:

> source("pbtree.R")
> tt<-log(100)-log(2)
> system.time(tree1<-pbtree(b=1.2,d=0.2,n=100,t=tt))
simulating with both taxa-stop (n) and time-stop (t) is
performed via rejection sampling & may be slow

  114 trees rejected before finding a tree

   user  system elapsed
   3.88    0.00    3.88
> length(getExtant(tree1))
[1] 100
> max(nodeHeights(tree1))
[1] 3.912023
>
> system.time(tree2<-pbtree(b=1.2,d=0.2,n=100,t=tt, method="direct"))
simulating with both taxa-stop (n) & time-stop (t) using
'direct' sampling. this is experimental
   user  system elapsed
   0.13    0.00    0.12
> length(getExtant(tree2))
[1] 100
> max(nodeHeights(tree2))
[1] 3.912023

We can also do somethings that are just impossible with rejection sampling - for instance, simulating with a very high extinction rate, e.g.:

> system.time(tree3<-pbtree(b=4,d=3,n=100,t=tt, method="direct"))
simulating with both taxa-stop (n) & time-stop (t) using
'direct' sampling. this is experimental
   user  system elapsed
  10.30    0.00   10.31
> tree3

Phylogenetic tree with 365 tips and 364 internal nodes.

Tip labels:
        t1, t2, t3, t12, t53, t56, ...

Rooted; includes branch lengths.
> length(getExtant(tree3))
[1] 100
> max(nodeHeights(tree3))
[1] 3.912023
> plotTree(tree,ftype="off",lwd=1)

This isn't without its limits, of course - but we could never do this via rejection sampling. Neat.

8 comments:

  1. "λ = m × (b - d)" was a typo and has been edited to "λ = m × (b + d)". Liam

    ReplyDelete
  2. Interesting. I once did a similar 'cheat' with simulating birth-death trees that had gone stochastically extinct by using rtree and replacing the branch lengths with exponential random variables with rate = b+d.

    ReplyDelete
    Replies
    1. Dave -

      Interesting. I'm not sure this gives the right distribution of waiting times. For instance, consider d->0? As d gets small, towards zero, you expect very few extinctions - but your rtree phylogeny would have many/all going extinct before the end of the simulation, correct?

      To be clear, this is not what I did here. In this function (with method="direct"), I merely simulated the waiting times & births/deaths first through time t; then evaluated whether or not they satisfied the criterion criterion N. Then I built the structure of the tree using these wait-times & births/deaths in memory.

      - Liam

      Delete
    2. Yes, for the sake of these simulations, it is presumed that a clade that is entirely extinct is desired. This becomes very unlikely as d goes to zero, but that should be obvious to a user. In my case, I tend to simulate b=d, as suggested by estimates from the marine invertebrate record (Stanley, 1977; Sepkoski, 1998).

      Yeah, I understood what you did was pretty different.

      Delete
  3. Liam,

    Simulating by conditioning on both the number of tips and the age of the tree can be done with Tanja Stadler's math (literally directly) and as implemented in the TreeSim function sim.bd.taxa.age(). The math to do this is described in Stadler 2011 (Sys Bio) http://sysbio.oxfordjournals.org/content/60/5/676.short. I will also note that Graham Slater's ABC MECCA approach uses this for the simulations and I used this in my paper looking at how "sampling bias" when looking at large phylogenies http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0043348

    cheers,
    matt

    ReplyDelete
    Replies
    1. Hi Matt.

      Yes, I'm aware of that. I was going to do a post about it comparing it to phytools - but I've been experiencing some problems with TreeSim for sim.bd.taxa.age(...,mrca=TRUE). Either it crashes R or, when I tried TreeSim 1.8 this morning, it works - but then R crashes when phytools is loaded. Not sure what's happening. In theory TreeSim should be much more powerful than phytools.

      A couple of things offered by pbtree that are not in TreeSim functions, so far as I can tell:

      1) pbtree can do discrete-time simulations (here), although so far only 'rejection' sampling is implemented for fixed N & t;

      2) pbtree can (optionally) retain extinct & extant lineages, like birthdeath.tree (e.g., here). I cannot find this in TreeSim, but I could be missing it.

      Thanks Matt!

      - Liam

      Delete
    2. Hi Liam,

      I've been struggling with this crashing problem in TreeSim as well. As far as I can tell, it's caused by a bug in collapse.singles in ape 3.0-8 which calls reorder on the tree with degree two nodes, leading to a seg fault and an R crash. Removing the reorder line from collapse.singles (or reverting to an older version of ape) stops the crashing. Just a heads up if you haven't already solved this problem.

      It's also worth noting that TreeSim can return (and does so by default) complete trees when you condition either on t or n, but not both. The algorithm to simulate trees conditional on t and n is efficient, but it's only simulating node heights for the reconstructed process, so cannot return extinct lineages.

      -Mike

      Delete
    3. Hi Mike.

      Thanks for the info.

      - Liam

      Delete