## Wednesday, January 5, 2011

### Simulating BM part II

In my last post I showed how to do slow Brownian motion simulation on a tree.

The reason that simulations using the approaches given last time are so slow is for two reasons: 1) in the simulations we first had to calculate the matrix vcv(phy), which contains the height above the root for the common ancestor of each pair of species in the tree "phy;" and 2) in the simulations we then had to factorize matrix either using singular value decomposition (in sim.char() or mvrnorm()) or by Cholesky decomposition, which is faster but less stable, (in chol()). Neither computing vcv(phy) nor factorizing vcv(phy) is particularly fast for large trees.

For instance, the calculation:

> tree<-birthdeath.tree(b=1,d=0,taxa.stop=200) # as before
> system.time(vcv(tree))
user system elapsed
4.01 0.00 3.84

Takes nearly 4 seconds!

Fortunately, there's a better way. This better way is provided by the convenient structure that {ape} uses to store trees in memory. In R, phylogenetic trees are stored as a list of class "phylo." This list contains several fundamental elements. The two elements we'll focus on here are "\$edge," a matrix containing the indices of each edge in the tree; and "\$edge.length," a vector containing the length of each edge in tree.

To start, we can use "\$edge.length" to create a vector containing the evolutionary changes along each branch of the phylogeny. We can do this using the {base} function rnorm(), which generates random normal deviates, as follows:

> x<-rnorm(n=length(tree\$edge.length),
sd=sqrt(sig2*tree\$edge.length))

Here, "x" is our vector of changes; and "sig2" is the instantaneous variance of the BM process (our evolutionary rate). Now we have to find a way to add these changes up the tree. This is provided by "\$edge."

First, we create the matrix y, as follows:

> y<-matrix(0,nrow(tree\$edge),ncol(tree\$edge))

Now, we add evolutionary changes up the tree starting from the root:

> n<-length(tree\$tip)
> for(i in 1:length(x)){
if(tree\$edge[i,1]==(n+1))
y[i,1]<-alpha # if at the root
else
y[i,1]<-y[match(tree\$edge[i,1],tree\$edge[,2]),2]
y[i,2]<-y[i,1]+x[i]
}

Here, "alpha" is our desired ancestral state at the root node of the tree.

[Note that doing this relies on a particular order for "\$edge;" that is, one in which every internal edge is listed before its descendants. This is the typical format for "phylo" objects, as created by read.tree() for instance, but I'm not sure whether or not it is the required format. If it isn't, then we should be able to sort \$edge before simulation to have this property.]

Finally, we can finish up with some bookkeeping:

> # create a vector with the root state in the first position
> rm(x); x<-c(y[1,1],y[,2])
> # name the vector by the node indices
> names(x)<-c(n+1,tree\$edge[,2])
> # sort the vector by name
> x<-x[as.character(1:(n+tree\$Nnode))]
> # rename the tip states with the species names in \$tip.label
> names(x)[1:n]<-tree\$tip.label

Now "x" actually contains all the states for all the terminal and internal nodes of the tree. (We generated them, so why not retain them?) To get just the tip states, we can do the following:

> x<-x[1:n] # just the terminal states

I have put all this into a function, fastBM(), which is [or will shortly be] available from my R phylogenetics page. When we load this function from source and time its execution on our 200 taxon tree, as before, we find that it is (indeed) very fast.

> source("fastBM.R")
> system.time(x<-fastBM(tree,alpha=0,sig2=1.0))
user system elapsed
0.14 0.00 0.17