## Friday, April 12, 2013

### Discrete- & continuous-time phylogeny simulation

Earlier today I posted a new version of the phytools pure-birth tree simulator: pbtree. This version is neat because it allows you to condition on the total number of tips in the tree, the total time, or (via rejection sampling) both.

However, when I started looking at pbtree it was actually because I wanted to add in discrete-time simulations, in addition to the continuous-time tree simulations already available in pbtree. (There is at least some interest in this.) Unlike discrete-time continuous character simulations, such as in the visualization tool bmPlot, simulating trees under a discrete-time model is not the same as simulating every time-step from 0 to the total time of the simulation. Rather, it just means we draw our waiting times between speciation events from a different probability distribution.

For continuous-time simulations we draw waiting times from an exponential distribution in which the rate parameter, λ, is equal to the speciation rate × the number of "open" edges at the last event. For discrete-time simulations, we draw waiting times from the discrete analog - the geometric distribution - which describes the probability that the first success in a series of Bernoulli trials occurs in trial x. Since the outcome of a discrete-time speciation process can be viewed as a Bernoulli trial (i.e., a lineage speciates, or it does not, in every time-step), this is the correct distribution of waiting times under our model.

There is a little wrinkle, here, though. Do you recall that to get the next wait time given m open edges we simulated rexp(n=1,lambda=b*m)? Well, in discrete-time the analogy is imperfect. What we need to do instead is simulate rgeom(n=m,prob=b) and then the wait-time to the next even is the minimum of that. Why? This is because we can think about the m open lineages as m different series of Bernoulli trials, and we want to find the wait time to the next event on any lineage in the tree.

In addition, we have to allow for the possibility that there are multiple speciation events within a single time-step - as long as all take place on different branches. To do this, we need to get the set of minimal rgeom(n=m,prob=b), rather than just a single minimum.

Now that we have a waiting time or set of (equal) waiting times, we just add a new node to any of the "open" edges in the tree, all with equal probability. We do that until we reach n or t. For discrete-time we sometimes have a problem when our stop criterion is the number of tips, not time. This is that sometimes multiple speciation events will occur within a timestep. The current version of the function allows this and spits a warning if the number of tips in the simulated tree are different from the stopping criterion.

The code for this updated version of pbtree is here; however since it uses an internal phytools function, to try it out you should install the latest version of phytools from source (phytools 0.2-43). Here's a quick demo:

> require(phytools)
> packageVersion("phytools")
 ‘0.2.43’
>
> ## this is our speciation prob to get 100 tips after
> ## 1000 time-steps, on average in discrete time
> b<-exp((log(100)-log(2))/1000)-1
>
> # simulate
> trees<-pbtree(b=b,t=1000,type="discrete",nsim=2000)
> N<-sapply(trees,function(x) length(x\$tip.label))
> mean(N)
 98.7815
> var(N)
 4848.484
> range(N)
 3 595
> hist(N,0:24*25,col="grey",xlab="frequency",ylab="number of tips",main=NULL)

As before, we can also condition on both N and t; however, since this is done via rejection sampling, it is probably wise (at least for illustrative purposes here) to choose a reasonable combination of b, n, and t. For instance:

> # use our speciation probability from before
> tree<-pbtree(b=b,n=100,t=1000,type="discrete")
simulating with both taxa-stop (n) and time-stop (t) is
performed via rejection sampling & may be slow

41 trees rejected before finding a tree

> plotTree(tree,ftype="off")

Cool. That's it for now.