Monday, February 28, 2011

MCMC method: analyzing the posterior sample

As I mentioned in my previous post, the posterior sample can be treated as would any posterior sample from a Bayesian analysis. For instance, we should expect to compute parameter estimates, effective sample sizes (which takes into account the degree of correlation among our samples), and 95% credible intervals mostly in the typical way. We can do many of these things within R using the MCMC diagnostics package {coda} (Plummer et al. 2010); however, for those users that prefer to use the Java program Tracer (by Rambaut & Drummond), this is fine to. We just need to first write our MCMC samples to a text file:

> write.table(x=res$mcmc,file="mcmc.sample.txt",sep="\t", row.names=FALSE,quote=FALSE)

The image below is a screenshot of the program Tracer, showing the trace of the log-likelihood of our analysis from last time:

There is no explicit manual for Tracer, but it is very easy to use. In additional posts, I will discuss how to do these analyses in {coda} as well as how to summarize the posterior sample for the position of the rate shift (something that cannot be done easily using either Tracer or {coda}).

Saturday, February 26, 2011

Ecomorph phylomorphospace

Check out the neat plot created by Luke Mahler using my phylomorphospace() function (available here). Click on the image at right to enlarge. In this plot are the phenotypes and phylogenetic relationships among 100 Greater Antillean Anolis lizards. According to Luke, PC1 is limb lengths (more or less); and PC2 is overall size. The tips of the tree are color coded by ecomorph. The phylomorphospace graphing method is based on Sidlauskas (2008).

Luke made a few modifications/additions to my code to obtain this plot. First, he filled and shrunk the points that are the inferred states at internal nodes. He did this by changing the following line of code:


to this:


To plot the large, color coded tip nodes, he merely created the vector color.code.e, containing the species color coded by ecomorph; and then plotted his raw data for PCs 1 & 2 on top of his phylomorphospace graph as follows:

> points(X[,1],X[,2],col=(color.code.e),pch=16,cex=1.7)

Thanks to Luke for sharing this.

Friday, February 25, 2011

MCMC method for rate variation

So, in addition to tracking new development activity with this page, I thought I would also post a few words about some of the existing functions that I am distributing from my R phylogenetics page.

Today, I will discuss the function evol.rate.mcmc(). This function takes a continuous character vector and a phylogeny with branch lengths, and then uses Bayesian MCMC to identify the phylogenetic position of a shift in the evolutionary rate of phenotypic evolution through time. Evolutionary rates and shift-points between rates are sampled from their joint posterior distribution. The underlying method for this function is described in a manuscript presently in review:

Revell, L. J., D. L. Mahler, P. R. Peres-Neto, and B. D. Redelings. Submitted. A new method for identifying exceptional evolutionary diversification.

To give a quick example of how this works, take the data and tree presented at right. These data were generated with a high rate (σ2=20) on the red branches of the tree; and a low rate (σ2=1.0) on the blue branches. The data are in a labeled vector, x, and the phylogenetic tree is given in an {ape} "phylo" object tree.

To load the source and required dependencies, we will first enter:

> source("evol.rate.mcmc.R")
> require(ape); require(coda);

The package {coda} will be used to analyze and compute various diagnostics for our MCMC chain (next post) - in this case our posterior sample - but we may as well load it now.

To run the function, we simply give R (some variant of) the following command:

> res<-evol.rate.mcmc(tree,x,ngen=20000,control=list(sd1=1.0, sd2=1.0,sda=0.5,sdlnr=2.0,rand.shift=0.05,print=1000))

This tells our MCMC to run under the following conditions:

ngen: the number of generations for our MCMC run is 20,000;
sd1: standard deviation of the proposal distribution for rate 1 is 1.0;
sd2: standard deviation of the proposal distribution for rate 2 is 1.0;
sda: standard deviation on the proposal distribution for the ancestral state at the root node of the tree is 0.5.
rand.shift: probability of a random shift to a different position in the tree is 0.05 [otherwise the MCMC will proceed as a random walk with a default variance];
print: print frequency for the chain (to screen - all generations are stored in res$mcmc).

We also need to assign a prior probability distribution to our parameters. I chose to specify a uniform prior on the absolute values of rates 1 and 2, the root state, and the position of the rate shift; however, I have used a log-normal prior distribution for the ratio of rate 1 / rate 2, centered on 0.0, and with a standard deviation given by sdlnr - in this case we set this to 2.0.

The run returns a list with two components:

$mcmc: is our matrix of parameters from our MCMC run, including the sample of shift points between rates on the tree; and
$tips: is a list of vectors containing all the tips in the clade defined as rate 1.

In my next post, I will discuss analysis of the posterior sample from this run.

Thursday, February 24, 2011

Luke Harmon phyloseminar

Luke Harmon (pictured right), who has been very active and productive in phylogenetic method development over the past few years, will be giving the next edition in the web seminar series on phylogenetics at Luke's seminar will take place tomorrow at 1PM Eastern and can be viewed live (instructions here). If you can't attend the live version of the seminar, it will also be archived here (along with a list of other great prior seminars in this series). Luke will be presenting a talk on a variety of new cutting-edge methods that are being developed in his laboratory at the University of Idaho.

Tuesday, February 22, 2011

New "beta" read.simmap() function available

I have added the capacity to read "nexus" style SIMMAPv1.0 tree files to my new read.simmap() function. To remind the reader, this function (v0.2 on my beta version page) is an improvement over earlier versions in that it allows the user to retain the order and the times spent in each mapped state along each edge of the tree. My earlier read.simmap() function (v0.1) destroyed the ordering of the states on each edge, storing only the total branch length in each state for each edge. The present version can read both "nexus" and "phylip" style tree files; as well as multiple trees (in which case, the function will create a modified "multiPhylo" object in memory).

For example, to load and use the present version:

> source("read.simmap.R")
> tree<-"((A:{aqua,0.3:terr,0.4:aqua,0.3},B:{aqua,1.0}):{aqua,1.0},(C:{aqua,0.25:terr,0.75},D:{aqua,1.0}):{aqua,1.0});"
> mapped.tree<-read.simmap(text=tree,format="phylip")

The modified "phylo" object that is created by read.simmap() in this case has the following structure:

> mapped.tree
[,1] [,2]
[1,] 5 6
[2,] 6 1
[3,] 6 2
[4,] 5 7
[5,] 7 3
[6,] 7 4
[1] 3
[1] "A" "B" "C" "D"
[1] 1 1 1 1 1 1
aqua terr aqua
0.3 0.4 0.3
aqua terr
0.25 0.75
edge aqua terr
5,6 1.00 0.00
6,1 0.60 0.40
6,2 1.00 0.00
5,7 1.00 0.00
7,3 0.25 0.75
7,4 1.00 0.00
[1] "phylo"

Monday, February 21, 2011

Ancestral character estimation, with a trend

In response to the recent R-sig-phylo email listserve discussion thread on simulating Brownian evolution with a trend, Pasquale Raia inquired to the list about whether it would be straightforward to estimate ancestral states under this model. In practice, of course, it is generally not possible to fit the BM with a trend model (BMT for brevity in this post) - nor to estimate ancestral states under BMT - for conditions in which all the tips of the tree (and our data for trait values) are contemporaneous. However, for conditions in which the tips of the tree are effectively not contemporaneous (for instance, if we have fossil species in the tree or if the evolutionary process is punctuational), then we are in a theoretically much better position to estimate evolutionary trends.

The way I decided to approach this problem was very simple. First, let's consider the expected distribution of the data at tips and internal nodes. Well, we know from theory (and Gene Hunt's listserve post) that the theoretical distribution of tip and (by simple extension) internal node values under BMT is:

> x<-mvrnorm(n=1,mu=a+diag(C)*mu,Sigma=sig2*C)

Here, a is the root value, mu is the trend per unit time, and sig2 is the variance of the stochastic component of the model. C is a matrix of the heights above the root for each terminal and internal node in the tree (excluding the root). For example, in the 4 taxon phylogram plotted above, the matrix C is given by:

> C
     t1     t3     t2     t4     6     7
t1 0.984819 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
t3 0.000000 1.1680555 0.4643410 0.3490977 0.3490977 0.4643410
t2 0.000000 0.4643410 0.4908593 0.3490977 0.3490977 0.4643410
t4 0.000000 0.3490977 0.3490977 0.5612080 0.3490977 0.3490977
6 0.000000 0.3490977 0.3490977 0.3490977 0.3490977 0.3490977
7 0.000000 0.4643410 0.4643410 0.3490977 0.3490977 0.4643410

Labels "t1", "t2", etc., correspond to tips of the tree; whereas "6", "7", etc., are internal node numbers from phy$edge.

Now, for a given set of internal and terminal values, C, and values for a, mu, and sig2, we can calculate the probability density using dmnorm() in the {mnormt} package. Of course, that means we can also calculate the likelihood the ancestral states and evolutionary parameters, conditioned on the values for terminal species and the tree.

I have just put up (on my R-phylogenetics page, here) a very preliminary version of a function that uses the built-in R multivariate optimizer, optim(), to do this.

To test it out, I did the following.

First, I loaded fastBM() and anc.trend() from source:

> source("fastBM.R"); source("anc.trend.R")

Now, I simulated a random tree and data using rtree() from the {ape} package and fastBM():

> tree<-rtree(20)
> x.all<-fastBM(tree,a=0,mu=2,sig2=1,internal=TRUE)

Next, I estimated ancestral traits by maximizing the likelihood using anc.trend():

> result<-anc.trend(phy=tree,x=x.all[1:length(tree$tip)], maxit=2000)

This function runs quite slow (not surprising, since in the above example I am doing a 22 dimension optimzation), so it has been somewhat hard to play around with so far. I can report that when optim() converges, mu is equivalent to the trend parameter estimated using fitContinuous(model="trend") in {geiger}.

In the above example, I obtained:

> result
     21     22     23     24     25     26
-0.2420098 0.1641928 1.2565491 3.0897363 1.8626764 1.0436854
     27     28     29     30     31     32
2.0145277 3.7843751 4.8605234 6.3360333 1.7286019 2.3675538
     33     34     35     36     37     38
3.1769088 5.3097681 3.3040749 2.1776491 0.2706769 0.9679458
[1] 1.693923
[1] 0.3092271
[1] -13.80569
[1] 0

Although mu estimated in my function, and mu from fitContinuous() are the same; sig2 differed (invariably) by exactly n/(2n-2). This is the same as the ratio of the number of tips/(number of tips + number of nodes) in an unrooted, bifurcating tree.

In addition, the estimated ancestral states are fairly closely correlated with their known conditions. For instance, when I ran the code given above:

> cor(c(0,x.all[21:38):length(x.all)]),result$ace)
[1] 0.5973801

I should probably emphasize that the above results likely oversell the method because in general this cannot be used for species with only contemporaneous tips (rtree() tends to produce phylogenies with wildly non-contemporaneous tips); and furthermore the simulated trend in this example was quite strong. Nonetheless - this is cool!

Wednesday, February 16, 2011

BM simulation with a trend

Dave Bapst just posted a query to the R-sig-phylo about simulating Brownian evolution with a trend. I take this to mean that evolutionary changes along branches in the tree are drawn from a random normal distribution with variances sig2*tree$edge.length and means mu*tree$edge.length. [Initially, in my first response to the query, I did not multiply mu by tree$edge.length; but that doesn't make any sense.] I have now added this capability to v0.2 of my function fastBM() which can be downloaded from my R phylogenetics page here.

After loading the function from source, we can just type:

> x<-fastBM(tree,a=0,mu=0.5,sig2=2,internal=FALSE)

at the command prompt. This will return a n x 1 vector of character values (in x for the n species in tree.

To test out this function, let's try simulating data and then fitting a "trend" model using {geiger}'s fitContinuous() function. [Obviously, since these are stochastic simulations, individual results will vary.]

> tree<-rtree(500) # first simulate stochastic tree
> x<-fastBM(tree,mu=0.5,sig2=2) # now data on the tree
> fitContinuous(phy=tree,data=x,model="trend") # now fit model
Fitting trend model:
[1] -900.9902
[1] 1.087359
[1] 1.87744
[1] 0.4968745
[1] 1807.980
[1] 1808.029
[1] 3

Seems to work!

Note that Gene Hunt also pointed out (correctly, of course) that this can be done using the {MASS} function mvrnorm(). To see the discussion thread on this topic, check out the archived R-sig-phylo emails here.

Tuesday, February 15, 2011

New read.simmap() function

OK, I have an "alpha" version of a new read.simmap() function that can read in a stochastic character mapped (SIMMAP v1.0) style tree from file in "phylip" format and retains the ordering of the character states along the edges of the tree. Although it is incomplete (by definition), I have posted this function on my R phylogenetics development page here (it is v0.2.alpha of read.simmap()).

The way I decided to store the information about the order and time spent in each state on every edge was in a list of named vectors called $maps. For backward compatibility with my other functions using SIMMAP format trees (e.g., brownie.lite()) I also compute the matrix $mapped.edge containing the summed times spent in each state along each edge of the tree.

So, for the following phylogeny:


the modified "phylo" object returned by my function has the following structure:

> temp<-"((A:{aqua,0.3:terr,0.4:aqua,0.3},B:{aqua,1.0}):{aqua,1.0},(C:{aqua,0.25:terr,0.75},D:{aqua,1.0}):{aqua,1.0});"
> tree<-read.simmap.alpha(text=temp)
> tree$Nnode
[1] 3
> tree$tip.label
[1] "A" "B" "C" "D"
> tree$edge
[,1] [,2]
[1,] 5 6
[2,] 6 1
[3,] 6 2
[4,] 5 7
[5,] 7 3
[6,] 7 4
> tree$edge.length
[1] 1 1 1 1 1 1
> tree$maps
aqua terr aqua
0.3 0.4 0.3
aqua terr
0.25 0.75
> tree$mapped.edge
edge aqua terr
5,6 1.00 0.00
6,1 0.60 0.40
6,2 1.00 0.00
5,7 1.00 0.00
7,3 0.25 0.75
7,4 1.00 0.00

Next, I will add the capacity to read "nexus" format SIMMAP v1.0 style trees.

Adding branch lengths

In a few different prior posts (1,2,3) I described the development of a simple R function, read.newick(), that is available from my development page here. This function just reads a single Newick style tree from a character string or from file, and then creates an {ape} "phylo" object in memory. The purpose of developing this function is to add the capability of reading stochastic character map information from within the Newick tree (a problem that I tackled a very different way in a prior post). I've also provided the code online so that readers can see the R implementation of our tree reading algorithm presented in a prior blog post.

Today I just added the reading of branch lengths to my previous function. This seems to work fine, and it is easy to see how I did this by reviewing my code - to which I have also added some comments.

Branch lengths in a Newick style tree are given after a full colon following either a label or a ")" right bracket. Thus, the branch lengths for our simplified primate tree might be: (((Human:7,Chimp:7):3,Gorilla:10):15,Monkey:25);, with branch lengths in millions of years (more or less). When our parser encounters the character ":", it just needs to assign the real number following the ":" to the edge preceding the node we are on. Otherwise, we just follow our algorithm as previously described.

The only amendment to the structure of the tree in memory is a vector, $edge.length, containing the length of each branch where the order corresponds to the order of the edges in $edge. Thus, for the tree (((Human:7,Chimp:7):3,Gorilla:10):15,Monkey:25);, we would get the following:

> temp<-"(((Human:7,Chimp:7):3,Gorilla:10):15,Monkey:25);"
> phy<-read.newick(text=temp) # or use read.tree()
> phy$Nnode
[1] 3
> phy$tip.label
[1] "Human" "Chimp" "Gorilla" "Monkey"
> phy$edge
[,1] [,2]
[1,] 5 6
[2,] 6 7
[3,] 7 1
[4,] 7 2
[5,] 6 3
[6,] 5 4
> phy$edge.length
[1] 15 3 7 7 10 25

Monday, February 14, 2011

Phylomorphospace updated

Luke Mahler identified an error in the node coloring option of my phylomorphospace() function. Basically, the function was assigning the colors of the daughter node to each parent node every time that a new edge was plotted. This has now been fixed and the updated version is available from my R-phylogenetics beta version page. Thanks Luke.

Reading a phylogeny from Newick (cont.)

In the last couple of posts, I have discussed a couple of basics regarding the input and storage in memory of phylogenetic trees in R. In particular, in the first post, I described the structure of the {ape} "phylo" object; and in the second, I described a standard algorithm (not of my own invention, but originally conveyed to me by Luke Harmon) for parsing a string in Newick format into a tree consisting of nodes and edges. The ultimate goal is not to read Newick trees into R (which read.tree() does quite well already), but to use this function as the basis for reading in other information from file - such as (for instance) the stochastic character map of a binary or multistate character.

Now our task is to translate the algorithm described in Saturday's post into computer code which will create the structure described in Friday's post. I have done this, and today I posted this function to my R-phylogenetics development page here; however, I remind the reader that this function is really provided for development/experimentation purposes only as it does nothing that is not already implemented in the {ape} function read.tree().

Ok, the first thing we need to do is figure out how many internal nodes and tips we have in the tree. We'd like to do this a priori, because that way we have two important pieces of information before starting: 1) how many rows to put in our matrix $edge; and (more importantly) 2) what number to assign to the root. The way I have done this is by taking one pass through the Newick string and counting the number of commas and the number of right brackets. It is my belief [readers, please correct me if this is wrong!] that the number of commas + 1 will always correspond to the number of tips in the tree; and that the number of right brackets will always correspond to the number of internal nodes.

Once we have figured out how many tips and internal nodes there is, it is straightforward to count the number of edges in the tree. This will just be the sum of the number of internal nodes + the number of tips - 1. The logic for this is as follows: every tip and internal node have an edge leading to them, except for the root node in the tree (hence the -1).

Now with these preliminaries completed, we can proceed to our algorithm for parsing the Newick string. Here, I start with a "while(tree[i]!=";")" loop that will terminate when the end character of our Newick style tree (";") is reached. Inside this loop, we can just program our character-by-character parser to build the tree based on the algorithm of my previous post and animation. I programmed this function more or less as illustrated, with a couple of caveats: 1) I never actually visit terminal nodes (I just create them as daughters, but then never update the status of "current node" to the daughter); and 2) in progressing backwards through the tree in response to character ")", the structure of the "phylo" object prevents easy backward movement through the tree. For this, I used the {base} function match() to match my current node's parent to the row number of prior nodes in the matrix edge.

Once all the elements of the tree are created, I just join them together in a named list using the function list(); and then I assign this list the class "phylo". Finally, I return the list to the R environment - and we're done!

Saturday, February 12, 2011

Building a tree in memory

Ok, as discussed in my last post, our basic goal is to read a Newick style tree [for example, the tree (((Human,Chimp),Gorilla),Monkey);] into computer memory. We would like to do this by parsing the character string, element-by-element, to build a representation of the phylogeny that consists of nodes and edge. Of course this functionality already exists in the form of the {ape} function read.tree(); however, if we can develop a parsing function, we can then adapt this function to simultaneously read in other kinds of information - for instance, the order and time spent in each character in a phylogenetic stochastic character map.

Before showing how this can be done to create the {ape} "phylo" object discussed in the last post, I thought it would be a good idea to present the basic algorithm that is used to parse a Newick tree. I'm not sure if this handy set of rules has ever been published, but it was conveyed to me verbally by Luke Harmon a number of years ago.

The basic algorithm reads the Newick tree "character-by-character" (actually, we will read node labels as if a single character), with the following rule.

Starting with the root node, if we encounter:
1) "(" -> we will create left daughter node;
2) "," -> we will go back to the parent node and create right daughter;
3) ")" -> we will go back to the parent;
4) ";" -> we will finish; and, finally
5) if we encounter a character string [not including special characters listed above], then we label the node with that string.

To be a little clearer, I made a simple animation for the (((Human,Chimp),Gorilla),Monkey); tree from last time. The animation goes through each character of the Newick string at the top of the screen, and creates nodes and edges according the rules above. Enjoy.


Friday, February 11, 2011

Reading trees

In a comment on one of my earlier posts Carl Boettiger remarked that my read.simmap() R function (to read stochastic character mapped phylogenies) destroys the ordering of states along each branch and retains only the time spent in each state on each edge. This ordering is unimportant for analyses based around an undirected Brownian motion model - such as, for instance, the model of O'Meara et al. (2006); however, ordering could be important for other models, such as the Ornstein-Uhlenbeck model (e.g., Butler & King 2004). Carl is a sharp guy and his point is well taken, so now that I have finally settled into my new digs here (not-so-subtle advertisement - grad students welcome!), I decided to try to tackle this problem today.

The first thing I realized was that it would not be practical to use the same "work-around" algorithm that I used for v0.1 of read.simmap() and described here. I was going to have to figure out how to read phylogenies into memory in R without being able to take advantage of the {ape} function read.tree()!

To convey what this problem involves, it is first important to understand how a phylogenetic tree is stored in memory in R by {ape}. read.tree() stores the phylogeny as list of class "phylo" with 3 basic elements, as follows:

$Nnode : this is an integer containing the number of internal nodes in the tree;
$tip.label : this is a vector containing the labels for all the tips in the tree; and finally
$edge : this is a matrix of dimensions ($Nnode+length(tip.label)-1 x 2) which contains the node numbers for all the internal parent (left column) and daughter (right column) nodes in the tree.

By convention in {ape}, terminal nodes are numbered 1:n for n species, where the numbers correspond to the order of the tip labels in $tip.label; and internal nodes are labeled (n+1):($Nnode+length(tip.label)) starting with the root node of the tree.

[Note that this is really different from how phylogeny storage is typically programmed in C; where we would use a network of linked custom data types called "structures. An example of this can be seen on my website, here.]

To give you an idea of what this would look like then, the following tree (see figure) in Newick format:


would have the following elements:

> tree$Nnode
[1] 3
> tree$tip.label
[1] "Human" "Chimp" "Gorilla" "Monkey"
> tree$edge
[,1] [,2]
[1,] 5 6
[2,] 6 7
[3,] 7 1
[4,] 7 2
[5,] 6 3
[6,] 5 4

More on this very soon. . . . .