Monday, January 3, 2011

Brownian motion simulation

In my last post, I simulated Brownian motion (BM) evolution on a simple 5-taxon tree in which I retained the states at internal nodes. Since some people might be interested in retaining ancestral character values during BM simulation, I will write a blog post on how to do this (actually, there are various ways).

Before doing so, I thought it would be fun to compare the different procedures (in terms of computation time) that we can use to conduct BM simulation on a tree in R (focusing only on the tip values for now).

First, we need to have a way to quantify computation time. This is easy in R using the {base} function system.time(). system.time(function), for valid R function "function," calls proc.time() once at the beginning of the execution of function and then a second time at the end, and then calculates the difference between these two values. proc.time() returns the current elapsed real and CPU (in seconds) of our current R session, so the difference between two calls to this function gives us the elapsed time of the system.

To get a sense of this, we can call system.time() on a simple R function, and then a more complicated one. On my super-slow netbook, this would look as follows:

> system.time(four<-2+2)
user system elapsed
0 0 0
> system.time(tree<-birthdeath.tree(b=1,d=0,taxa.stop=200))
user system elapsed
0.40 0.00 0.64

The number under "elapsed" is the relevant value here.

Even though we called the {geiger} function birthdeath.tree() (which grows Yule trees under a user specified speciation/extinction model) from within system.time(), it still assigns our simulated phylogeny to "tree" in this case. Here, we see that adding 2+2 takes essentially no time; whereas simulating a 100 species tree takes a little more than half a second. Fun.

Ok, now I'm going to propose three different ways to simulate BM on our 200 species tree. We focus on getting the tip values for species here - later, I will show how to get ancestral character values as well. None of these methods are particularly fast; we can do much better and this will be the subject of a future post.

I'm going to simulate using three different approaches: 1) I will use the {geiger} function sim.char(); 2) I will use the {MASS} function mvrnorm(); and, finally, 3) I will use the {base} function chol() (for Cholesky decomposition) and the {stats} function rnorm() (for random number generation).

Ok, to simulate with sim.char() is straightforward. sim.char() is a really neat function that can actually simulate under a variety of models - including a discrete character model. BM is the default, so to call (and time) sim.char(), we should just do this:

> system.time(x<-sim.char(phy=tree,model.matrix=matrix(1,1,1)))
user system elapsed
11.06 0.00 12.22

This shows that (on my computer) simulating the evolution of a single character on a 200 taxon stochastic phylogeny takes about 12 seconds.

We can also use the function mvrnorm(). mvrnorm() simulates a data vector from a multivariate normal distribution. For BM on a phylogeny, we should just do the following:

> system.time(x<-mvrnorm(Sigma=1*vcv(tree),
user system elapsed
3.59 0.00 4.29

Even though sim.char() uses mvrnorm(), the latter is much faster at simulating BM evolution. (This is because some of the internal calculations of sim.char() turn out to be quite slow.)

Finally, we can use the Cholesky decomposite matrix and multiply by a random normal vector to achieve the same effect.

> system.time(x<-t(chol(vcv(tree)))%*%
user system elapsed
3.77 0.00 3.92

This is slightly faster than mvrnorm(), but almost indistinguishable. The difference between chol() and mvrnorm() is that the latter uses to singular value decomposition to factorize the vcv(phy) matrix. Cholesky decomposition is faster but less stable if vcv(phy) is ill-conditioned. Ill-conditioned vcv(phy) under BM will be the case if we have very short terminal branches in the tree (which results in the lowest rank eigenvalue of vcv(phy) going to zero). To avoid this in simulation we should do:

> tree<-drop.tip(birthdeath.tree(b=1,d=0,taxa.stop=201),"201")

This is because birthdeath.tree() stops at the most recent speciation event, leaving a divergence depth of 0.0 between the newest tips on the tree if the "taxa.stop" criterion is used.

More to come on this later. . . .

No comments:

Post a Comment