Thursday, October 27, 2011

New function to plot "eras" on the tree

I just created and posted a new function,, that maps "eras" (that is, spans of time defined by a set of limits above the root node of the tree) on a phylogeny. Source code is here. To illustrate what I mean by that, let's load the function & try it:

> require(phytools)
Loading required package: phytools
Loading required package: ape
> require(geiger)
Loading required package: geiger
> setwd("")
> source("")
> tree<-drop.tip(birthdeath.tree(b=1,d=0,taxa.stop=101),"101")
> tree<-rescaleTree(tree,290)

Now let's define our eras on the tree:

> periods<-c(290,245,210,140,65,1.5)
> names(periods)<-c("P","Tr","J","C","Te","Q")
> periods
    P    Tr    J    C    Te    Q
290.0 245.0 210.0 140.0 65.0 1.5

Now let's use and plotSimmap() to map & plot these geological periods on the tree. Note, that I will plot the limits as 290 minus the limits, because the function normally accepts times from the root node (not from the present, as we have expressed these geological time periods).

> mtree<,290-periods)
> cols<-c(1,2,4:7); names(cols)<-names(periods)
> plotSimmap(mtree,cols,pts=0,ftype="off")

This is what we get:

Pretty cool, huh?

To illustrate the use of this analysis, let's imagine we wanted to test the hypothesis (say, using the approach of O'Meara et al. 2006) that the rate of evolution for a phenotypic trait changed as a function of geological period. The first thing we need to do is plot the geological periods on the tree, as we have done using Next we fit the model to our mapped tree and the data using brownie.lite()

In this case, since we don't actually have real data - let's simulate some. We can do that using the phytools function sim.rates(). Let's pretend the evolutionary rate declined with each passing era (halving, in this case):

> sig2<-2^(5:0); names(sig2)<-names(periods)
> sig2
P Tr J C Te Q
32 16 8 4 2 1
> X<-sim.rates(mtree,sig2)

Now we can fit the model:

> X<-sim.rates(mtree,sig2)
> fit<-brownie.lite(mtree,X,maxit=4000)
> fit
[1] 2.975852
[1] -399.7999
[1] 2
P Tr J C Te Q
0.002975852 29.886666888 6.587138450 5.385521826 2.102272018 1.078584364
[1] -393.0759
[1] 7
[1] 0.01952338
[1] "Optimization has converged."

Note, that there is very little power in the deepest period represented on the tree because this is only covered by two branches!

I will post more on how I did this tomorrow.

Saturday, October 22, 2011

Small error in phylosig(), fixed

I realized today after posting the new version of phylosig() online yesterday (described here) that I had made one small error. That is, in the function call the user must assign the variable se, implying "standard errors," but in the function code I treat these as sampling variances (i.e., the square of the SE) rather than as standard errors. This has now been fixed (but please download v0.5 here), so the user can supply a (named) vector of standard errors as intended.

Friday, October 21, 2011

Phylogenetic signal with measurement error

I just posted a new version of the function phylosig() that computes phylogenetic signal using Blomberg's K (Blomberg et al. 2003) and Pagel's λ methods.

The main improvement of this version is that I have added estimation of K with known measurement error, following Ives et al. (2007). This was in response to a user's email request.

This was not too difficult. Basically, without measurement error the values for the tip states on the tree under BM are distributed as a multivariate normal with variance-covariances given by σ2C, where C is an n×n matrix (for n species) containing the height above the root node of the common ancestor of each i,jth species pair on the tree. (For diagonal elements, these are just the heights of each tip node.)

With measurement error, the picture complicates slightly. Now, say the VCV matrix for the measurement error is given by M, then the multivariate distribution of the tip values is given by σ2C+M. The only problem with this is that although we have an analytic solution for σ2 conditioned on no measurement error, we do not for known M. Instead, we have to find σ2 by other means - say, by maximizing the likelihood. This is what I have done in the latest version of phylosig() (v0.4), available here.

To run this, first download the source and load "phytools":

> require(phytools)
> source("phylosig.R")

Now, just for fun, let's simulate data with measurement error.

> tree<-drop.tip(birthdeath.tree(b=1,d=0,taxa.stop=301),"301")
> x<-fastBM(tree) # simulate data
> se<-rchisq(n=300,df=1) # simulate SEs
> names(se)<-names(x)
> e<-rnorm(n=300,sd=se)
> xe<-x+e

We have data generated without error (in x) and with known error (in xe). First, let's compute phylogenetic signal for x:

> phylosig(tree,x)
[1] 1.013599

This is more or less what we expect. Now let's try xe, but with measurement error ignored:

> phylosig(tree,xe)
[1] 0.2640041

Finally, we can try our new function incorporating measurement error:

> phylosig(tree,xe,se=se)
[1] 1.121445

[1] 0.9772118

[1] -543.0133

Very cool! Note that the logL reported here is the log-likelihood for the value of sig2 conditioned on the known measurement error we've provided.

Of course, this is not magic - so the more measurement error we have, the less power we will have to measure phylogenetic signal. That said, incorporating measurement error has the nice property of making our estimates of signal unbiased (whereas they are biased downwards when signal is ignored).

Monday, October 17, 2011

New version of phytools (v0.1-0) on CRAN

I just submitted a new version of "phytools" to CRAN and it is now available online (phytools CRAN page here). This new version (v0.1-0) has a number of new functions over the previous CRAN version (v0.0-8) but little in terms of functional updates from the last version of "phytools" v0.0-9 which I posted on my website but not to CRAN. Most of the updates in the latest version involve improving the documentation (although there is still a lot of work to be done in this area); removing strictly internally used functions from the namespace (this means they can no longer be accessed by users, which is good); and moving a number of little-used dependencies from "Depends" to "Imports." This helps avoid cluttering the namespace (that is, having too many packages loaded with all their named functions stored in memory).

Note that at time of writing, the Windows & Mac OS versions of "phytools" were still migrating through cyberspace to all the mirror CRAN repositories. Thus, to download the latest version of "phytools" in binary form, please check the list of CRAN mirrors to find one that has the binary versions uploaded (e.g., Austria; please note that in general it is a good idea to pick a CRAN mirror geographically near to you so as to minimize network load).

Comments, problems welcome.

Sunday, October 16, 2011

Updating "phytools" package

Today I'm working on the next release of "phytools" v0.1-0. I have a manuscript that describes the "phytools" library that has been reviewed and accepted pending revision at Methods in Ecology & Evolution (link down at time of writing; also, read the journal here). In the article reviews I received some helpful suggestions to bring "phytools" in line with best practices for writing R extensions, and I am presently working on implementing some of these.

Among other suggestions, one thing that thing that was recommended was that I should move internally used dependencies in the DESCRIPTION file from the "Depends:" line to the "Imports:" line. Then, you can add to the NAMESPACE file a line specifying which functions should be loaded from which packages. For instance, the function hessian() in the "numDeriv" package is called by my function evol.vcv(). In the DESCRIPTION file we put:

Imports: numDeriv, ...

and in the NAMESPACE file we put:


Seems to work.

Tuesday, October 11, 2011

Simulating a set number of changes in a discrete character on the tree using sim.history()

An R phylogenetics user recently queried as to whether it was possible to simulate random character evolution with a fixed number of changes - say 2, 5 or 10 - in a binary character on the tree. In fact, this can be done using the function sim.history() in the latest version of "phytools" (not yet available on CRAN, but downloadable from my website here).

I submitted the following code fragment:

nchanges<-5 # desired number of changes
ntrees<-100 # desired number of trees
ntaxa<-100 # desired number of taxa
require(phytools); require(geiger) # required packages
for(i in 1:ntrees){
  # simulate a stochastic B-D tree using birthdeath.tree
  # compute the rate that results in (on average) nchanges changes
  # simulate a stochastic history
  # loop until desired number of changes
  # for fun, plot the history
# tip states for the kth tree are in mtrees[[k]]$states

This is for a desired count of 5 changes on the tree. The way it works is by first computing the substitution rate that will produce an expected number of changes of 5 (in this case); and then by simulating repeatedly with this rate, retaining only those simulations that result in 5 changes on the tree.

The result is stored in a "simmap" style tree; with tip states in the list component $states.

Below is an example of a tree simulated using this method, but with nchanges=10 and plotSimmap(...,pts=0,fsize=0.7):

Estimating a single rate on the tree with brownie.lite()

A user just contacted me to inquire as to how he could obtain the MLE of the single evolutionary rate for a character on a phylogenetic tree. This can easily be done with brownie.lite(), but it involves 1 simple extra step. To describe this, let me illustrate it with an example. First, let's load "phytools" and simulate a random tree:

> require(phytools)
> tree<-rtree(150) # from "ape"

Now, generate random data on the tree using fastBM():

>x<-fastBM(tree,sig2=5) # evolutionary rate is sig^2=5

OK, now the trick. brownie.lite() is designed to fit multiple evolutionary rates to the data & tree, and it reads the rate regimes from the stochastic map style trees created by read.simmap() & make.simmap(). The problem here is that we only have one rate regime. We can solve this easily by appending the matrix, $mapped.edge to our "phylo" object tree. We do this as follows:

> tree$mapped.edge<-matrix(tree$edge.length,nrow(tree$edge), 1,dimnames=list(NULL,"0"))

Now we can run brownie.lite():

> brownie.lite(tree,x,maxit=10000)
[1] 5.441812
[1] -347.2953

Note that if all we want is the MLE of the rate, and don't care at all about the likelihood, we can skip a lot of these steps & just do:

> mean(pic(x,tree)^2)*(length(tree$tip)-1)/length(tree$tip)
[1] 5.441812

which gives the same rate! (Note that we multiply by n-1/n to get the ML rate, but the mean square of the contrasts is actually unbiased. See O'Meara et al. 2006.)

Monday, October 3, 2011

New paper on ABC comparative method

A new paper by Graham Slater and colleagues has just come out "Online Accepted" in Evolution. A direct link to the paper is here. In this article, the authors present a new method in which Approximate Bayesian Computation is used to simultaneously estimate diversification rates & rates of phenotypic evolution in an incompletely sampled tree. Incidentally, I am a (minor) co-author on this article (my contribution was to write some computer code in C that was used in early versions of the software). Check it out.

Symbolic Cholesky factorization (& other things)

Just a quick update. I'm working on developing a version of evol.vcv() that can be constrained in various ways. evol.vcv() implements the method of Revell & Collar (2009), in which we fit two or more different VCV matrices for the Brownian process to different parts of a phylogenetic tree. The type of constraints we might like to impose include, say, forcing rates (diagonal elements) to be equal among different regimes; or rates to differ and covariances to be the same; etc.

Unfortunately, implementing this has been somewhat more complicated than I originally appreciated for a few reasons, not the least among them being that in the function evol.vcv() (code here), I perform ML optimization on the Cholesky factorized rate matrices, rather than on the original matrices. This is hard to avoid because the likelihood function is undefined where the VCV matrices are not positive definite, and this causes problems for the R optimizers optim() if ignored.

I haven't had as much time to address this as I'd hoped, but as a first pass I am working on solving the problem for 3 × 3 matrices and two rate regimes. Hopefully a general solution appropriate for any size matrix or number of regimes will follow.

I'm trying a couple of different tacks. First, I thought I would symbolically factorize a symmetric matrix and then perhaps I could optimize with respect to the parameters of the symbolic functions in the Cholesky factorization. Unfortunately, on closer examination, this is no more straightforward than the original problem! This is because the parameters of the symbolic functions have the same constraints as the original matrices! (Duh.)

My newest attempt is cheaper & simpler. I am now trying to optimize the original matrices, but just returning logL=-Inf for any set of proposed values that fail a test for positive-definiteness. I am also trying some of the different optimization methods implemented in optim() to see if this has an effect. So far this simpler approach is showing some promise. I will report back when I have some further results.