Tuesday, January 25, 2011

Easy way to set working directory at startup

Note that this applies only to MS Windows users of R.

I today discovered a super easy way to set and change the working directory at startup for Rgui.exe sessions.

First, right click on the shortcut that you use to execute the R gui.

Next, select: "Start in:" from Properties->Short cut.

Finally, enter the complete path to the directory that you want R to start in.

The outcome is illustrated in the picture at right (obtained with Windows 7's built in "snipping" tool).

Tuesday, January 11, 2011

Addendum to fast Brownian simulation

Before I move on to the next full post, I wanted to add a quick addendum to the previous one.

That is, that we can use our new fastBM() function (available here) to simulate under models of evolution other than Brownian motion as well. This by using Luke Harmon et al.'s {geiger} tree transformation function deltaTree(). This function transforms the branch lengths of the tree to be proportional in length to the variances and covariances among species under various models. For instance, the function call:

> tr<-ouTree(phy,alpha)

returns a tree with branch lengths proportional to the variances and covariances among species expected under an Ornstein-Uhlenbeck process with constraint ("selection") parameter, alpha. Similarly:

> tr<-speciationalTree(phy)

returns a tree with all edge lengths set to 1.0.

To simulate under these models using the fastBM() function we came up with last week, we just send fastBM() the phylogenetic tree returned by deltaTree(). For example, to simulate under an Ornstein-Uhlenbeck model with the constraint parameter, alpha, set to alpha=3, we just compute:

> x<-fastBM(tree=ouTree(phy,alpha),alpha,sig2)

substituting our desired values for sigma-squared and the root value for sig2 and alpha (the second) in fastBM(), respectively. [Sorry for the confusing double use of "alpha" here, folks. By convention, alpha is the restraining force in an Ornstein-Uhlenbeck model e.g., Butler & King (2004); unfortunately, alpha is also sometimes given as the ancestral state at the root node of the tree. I'll try to remember to fix this in any updates of fastBM().]

Keep in mind, fastBM() has given us all the terminal and internal states, so should we want only the tip states, we need to then compute:

> x<-x[1:length(phy$tip)]

Similarly, we can simulate under the purely speciational model as follows:

> x<-fastBM(tree=speciationalTree(phy),alpha,sig2)

Here, sig2 is the variance on the speciational "jumps" of evolutionary change during punctuational evolution in our clade.

Simulating character evolution this way is also quite fast. This is because the function ouTree() is fast. (I think this is because it requires just two passes of the tree - one to get the branching times and another to rescale all the edges by the OU model.) For instance, using a 200 species stochastic tree as before, we can see that the simulation:

> system.time(x<-fastBM(tree=ouTree(phy=tree,alpha=2),
   sig2=1,alpha=0))
user system elapsed
0.34 0.00 0.34


takes only about 1/3 of a second on my netbook.

Wednesday, January 5, 2011

Simulating BM part II

In my last post I showed how to do slow Brownian motion simulation on a tree.

The reason that simulations using the approaches given last time are so slow is for two reasons: 1) in the simulations we first had to calculate the matrix vcv(phy), which contains the height above the root for the common ancestor of each pair of species in the tree "phy;" and 2) in the simulations we then had to factorize matrix either using singular value decomposition (in sim.char() or mvrnorm()) or by Cholesky decomposition, which is faster but less stable, (in chol()). Neither computing vcv(phy) nor factorizing vcv(phy) is particularly fast for large trees.

For instance, the calculation:

> tree<-birthdeath.tree(b=1,d=0,taxa.stop=200) # as before
> system.time(vcv(tree))
user system elapsed
4.01 0.00 3.84


Takes nearly 4 seconds!

Fortunately, there's a better way. This better way is provided by the convenient structure that {ape} uses to store trees in memory. In R, phylogenetic trees are stored as a list of class "phylo." This list contains several fundamental elements. The two elements we'll focus on here are "$edge," a matrix containing the indices of each edge in the tree; and "$edge.length," a vector containing the length of each edge in tree.

To start, we can use "$edge.length" to create a vector containing the evolutionary changes along each branch of the phylogeny. We can do this using the {base} function rnorm(), which generates random normal deviates, as follows:

> x<-rnorm(n=length(tree$edge.length),
   sd=sqrt(sig2*tree$edge.length))


Here, "x" is our vector of changes; and "sig2" is the instantaneous variance of the BM process (our evolutionary rate). Now we have to find a way to add these changes up the tree. This is provided by "$edge."

First, we create the matrix y, as follows:

> y<-matrix(0,nrow(tree$edge),ncol(tree$edge))

Now, we add evolutionary changes up the tree starting from the root:

> n<-length(tree$tip)
> for(i in 1:length(x)){
  if(tree$edge[i,1]==(n+1))
    y[i,1]<-alpha # if at the root
  else
    y[i,1]<-y[match(tree$edge[i,1],tree$edge[,2]),2]
  y[i,2]<-y[i,1]+x[i]
}


Here, "alpha" is our desired ancestral state at the root node of the tree.

[Note that doing this relies on a particular order for "$edge;" that is, one in which every internal edge is listed before its descendants. This is the typical format for "phylo" objects, as created by read.tree() for instance, but I'm not sure whether or not it is the required format. If it isn't, then we should be able to sort $edge before simulation to have this property.]

Finally, we can finish up with some bookkeeping:

> # create a vector with the root state in the first position
> rm(x); x<-c(y[1,1],y[,2])
> # name the vector by the node indices
> names(x)<-c(n+1,tree$edge[,2])
> # sort the vector by name
> x<-x[as.character(1:(n+tree$Nnode))]
> # rename the tip states with the species names in $tip.label
> names(x)[1:n]<-tree$tip.label


Now "x" actually contains all the states for all the terminal and internal nodes of the tree. (We generated them, so why not retain them?) To get just the tip states, we can do the following:

> x<-x[1:n] # just the terminal states

I have put all this into a function, fastBM(), which is [or will shortly be] available from my R phylogenetics page. When we load this function from source and time its execution on our 200 taxon tree, as before, we find that it is (indeed) very fast.

> source("fastBM.R")
> system.time(x<-fastBM(tree,alpha=0,sig2=1.0))
user system elapsed
0.14 0.00 0.17

Monday, January 3, 2011

Brownian motion simulation

In my last post, I simulated Brownian motion (BM) evolution on a simple 5-taxon tree in which I retained the states at internal nodes. Since some people might be interested in retaining ancestral character values during BM simulation, I will write a blog post on how to do this (actually, there are various ways).

Before doing so, I thought it would be fun to compare the different procedures (in terms of computation time) that we can use to conduct BM simulation on a tree in R (focusing only on the tip values for now).

First, we need to have a way to quantify computation time. This is easy in R using the {base} function system.time(). system.time(function), for valid R function "function," calls proc.time() once at the beginning of the execution of function and then a second time at the end, and then calculates the difference between these two values. proc.time() returns the current elapsed real and CPU (in seconds) of our current R session, so the difference between two calls to this function gives us the elapsed time of the system.

To get a sense of this, we can call system.time() on a simple R function, and then a more complicated one. On my super-slow netbook, this would look as follows:

> system.time(four<-2+2)
user system elapsed
0 0 0
> system.time(tree<-birthdeath.tree(b=1,d=0,taxa.stop=200))
user system elapsed
0.40 0.00 0.64


The number under "elapsed" is the relevant value here.

Even though we called the {geiger} function birthdeath.tree() (which grows Yule trees under a user specified speciation/extinction model) from within system.time(), it still assigns our simulated phylogeny to "tree" in this case. Here, we see that adding 2+2 takes essentially no time; whereas simulating a 100 species tree takes a little more than half a second. Fun.

Ok, now I'm going to propose three different ways to simulate BM on our 200 species tree. We focus on getting the tip values for species here - later, I will show how to get ancestral character values as well. None of these methods are particularly fast; we can do much better and this will be the subject of a future post.

I'm going to simulate using three different approaches: 1) I will use the {geiger} function sim.char(); 2) I will use the {MASS} function mvrnorm(); and, finally, 3) I will use the {base} function chol() (for Cholesky decomposition) and the {stats} function rnorm() (for random number generation).

Ok, to simulate with sim.char() is straightforward. sim.char() is a really neat function that can actually simulate under a variety of models - including a discrete character model. BM is the default, so to call (and time) sim.char(), we should just do this:

> system.time(x<-sim.char(phy=tree,model.matrix=matrix(1,1,1)))
user system elapsed
11.06 0.00 12.22


This shows that (on my computer) simulating the evolution of a single character on a 200 taxon stochastic phylogeny takes about 12 seconds.

We can also use the function mvrnorm(). mvrnorm() simulates a data vector from a multivariate normal distribution. For BM on a phylogeny, we should just do the following:

> system.time(x<-mvrnorm(Sigma=1*vcv(tree),
mu=rep(0,length(tree$tip))))
user system elapsed
3.59 0.00 4.29


Even though sim.char() uses mvrnorm(), the latter is much faster at simulating BM evolution. (This is because some of the internal calculations of sim.char() turn out to be quite slow.)

Finally, we can use the Cholesky decomposite matrix and multiply by a random normal vector to achieve the same effect.

> system.time(x<-t(chol(vcv(tree)))%*%
rnorm(length(tree$tip),sd=1))
user system elapsed
3.77 0.00 3.92


This is slightly faster than mvrnorm(), but almost indistinguishable. The difference between chol() and mvrnorm() is that the latter uses to singular value decomposition to factorize the vcv(phy) matrix. Cholesky decomposition is faster but less stable if vcv(phy) is ill-conditioned. Ill-conditioned vcv(phy) under BM will be the case if we have very short terminal branches in the tree (which results in the lowest rank eigenvalue of vcv(phy) going to zero). To avoid this in simulation we should do:

> tree<-drop.tip(birthdeath.tree(b=1,d=0,taxa.stop=201),"201")

This is because birthdeath.tree() stops at the most recent speciation event, leaving a divergence depth of 0.0 between the newest tips on the tree if the "taxa.stop" criterion is used.

More to come on this later. . . .