Friday, December 24, 2010

Reading SIMMAP trees into R

Stochastic character mapping, originally due to Nielsen (2001, 2002) and Huelsenbeck et al. (2003) is a procedure whereby mutations or character substitutions are mapped onto the nodes and branches of a phylogenetic tree. Maps are generated stochastically, but with probability equal to their posterior probability given our transition model. Sampling stochastic character maps, then, may be a good way to capture information about both the states at internal positions in the tree, as well as the timing of character substitution through the phylogeny. As noted in an earlier post, the background of this blog are visualized stochastic character maps for "ecomorph" mapped on the Caribbean Anolis tree (from Mahler et al. 2010).

The program SIMMAP for stochastic mutational mapping was developed by Jonathan Bollback. It performs stochastic mutation mapping, and outputs stochastically mapped trees to file using a modified Newick format. Since stochastic character maps can be very useful for certain kinds of downstream phylogenetic analysis, I was very interested in writing a function that would read these modified Newick trees into R.

The basic format of stochastic character mapped phylogenies in v1.0 of SIMMAP is as follows:

((A:{0,0.3:1,0.4:0,0.3},B:{0,1.0}):{0,1.0},(C:{0,0.25:1,0.75},D:{0,1.0}):{0,1.0});

This is comparable to the standard Newick style tree except that branch lengths have been replaced by { curly brackets } according to the following convention: { state i, time spent in state i : state j, time spent in state j : etc. }.

I'm happy to announce that I have now developed usable code to read SIMMAP v1.0 trees in R. This function, read.simmap(), accepts either or Phylip or Nexus style input files (the latter courtesy in large part to code borrowed from the "ape" function read.tree()). My function creates a modified "phylo" object, with the additional element \$mapped.edge which is a matrix containing the time spent in each state (columns) on each edge (rows) of the tree. The source code for this function is available as a beta test version from my R phylogenetics page, here, along with some very basic description of use.

Unfortunately, this announcement comes with a hitch. In the most recent release of SIMMAP (v1.5) Bollback has updated the SIMMAP tree format so that the tree above would be represented as follows:

((A:[&map={0,0.3,1,0.4,0,0.3}]1.0,B:[&map={0}]1.0):[&map={0}]1.0,(C:[&map={0,0.25,1,0.75}]1.0,D:[&map={0}]1.0):[&map={0}]1.0);

The reason for this, I surmise, is that now these trees conform to the Newick standard (in which, by convention, text inside [ hard brackets ] is ignored) and thus can be read by programs not specifically designed to read SIMMAP style trees. Unfortunately, Bollback does not provide any means of converting between old and new formatted trees.

I'm working on reading the new SIMMAP format into R as an option in my read.simmap() function. I will keep readers of this blog up to date on my progress on this problem. I will also work on adding the capacity to retain the order of states on each edge, as the present version of my function destroys this information during read-in (retaining only the total time spent in each state on each edge).

Please feel free to test my function with SIMMAP v1.o formatted trees and let me know what you find.

1. Thanks for sharing! Is your R function just reporting the total time spent in each state, rather than the number/position of transitions? (for instance, in the example text<-"((A:{aqua,0.3:terr,0.4:aqua,0.3},B:{aqua,1.0}):{aqua,1.0},(C:{aqua,0.25:terr,0.75},D:{aqua,1.0}):{aqua,1.0});"

the mapped.edge just tells me that branch for species A is 60% aqua and 40% terr, doesn't tell me that two transitions occurred: aqua 0.3, terr 0.4, aqua 0.3. Did I miss something?

While it wouldn't make a difference for Brownian model, it would in general be useful to preserve the whole mapping. It would allow someone to write a nice plotting function, and it could be used in OU models, etc, where it would matter.

2. Hi Carl. Yes, you're right - the function does destroy the ordering of the states on each edge, as noted in my next post. The reason for doing this was to create an object that would work with a couple of my other functions, for which the ordering is not important. Retaining that information would be valuable for some analyses and for plotting, as you say. I will try to do this soon.

3. Hi Liam, I have tried the following code:
However, it gives me this error:
Error in FUN("((A:{0,0.3:1,0.4:0,0.3},B:{0,1.0}):{0,1.0},(C:{0,0.25:1,0.75},D:{0,1.0}):{0,1.0});"[[1L]], :
Any idea how to solve this?

1. I could resolve this by setting trans = NULL, but I think this should be resolved in the function itself.

2. Yup - this (not being able to read from a character string supplied through the argument text) is a bug that I seem to have introduced at some point. I have now fixed it - more details here. Thanks for the bug report.

BTW, I suspect based on your tree you also need to set rev.order=TRUE - the default is FALSE (if I remember correctly) based on the output of the program SIMMAP.

4. One other thing: when producing a tree with a stem, e.g.

fftree = read.simmap(text = "((S1:{0,1:1,1},S2:{0,1:1,1}):{0,1:1,1});")

read.simmap gives no error, but trying to plot it with

plotSimmap(fftree)

I get

Error in reorder.phylo(tree, order) : tree apparently badly conformed

1. Hmmm. Yes, it is not designed to read a root edge. I'm surprised it seems to read the tree at all. I think if you removed the outer parentheses (which are not required with a stem edge under the Newick standard, although they are permitted) it would fail completely. I will try to add this feature to a future version. It would not be too hard.

5. Dear Liam,
I'm not familiar with R (I’m just starting to explore what I can do with it!) therefore my questions may be very basic. I have some problems with the phylo and multiphylo objects in phytools.
1) I want to make a stochastic character mapping on a multi-phylo object, using the function make.simmap. The multi-phylo object I’m using is not a modified phylo object (I never used the program SIMMAP, the trees I’m using are simple ultrametric trees obtained with BEAST). Before applying the make.simmap I must prune the trees (I lack data for some terminals). Therefore I created an empty list where I saved the pruned trees. But make.simmap does not read this newly created list, the following error pops up: Error in make.simmap(arboles.nuevos, syn, model = "ER", nsim = 1000) :
'tree' should be an object of class 'phylo'
How can I do for applying the make.simmap function to a set of pruned trees?
Once I did this, how can I access the elements in these trees? (e.g. estimated transition rates)
2) Using the maximum credibility tree I obtained with BEAST, I want to add pie charts to the nodes indicating the posterior probability of different adaptive strategies (as the ones you show in the examples of your chapter “Graphical Methods for Visualizing
Comparative Data on Phylogenies” in the book Modern Phylogenetic Comparative Methods and their Application to Evolutionary Biology. I was able to run the make.simmap function on this tree, but I cannot access the state of the nodes at each iteration (I need to access this in order to prepare the pie charts). How can I access this information?
3) I want to use the map.overlap function to test for the overlap of the stochastic evolutionary history of two traits. I used the objects I created with the make.simmap function (e.g. history.t.1<-make.simmap();history.t.2<-make.simmap()); but an error message tells me that Error in is.rooted(target) : object "phy" is not of class "phylo".
I don’t understand why this happens. Aren’t history.t.1 and history.t.2 modified “phylo” objects? I would really appreciate if you can tell me how the phylo objects I input in the map.overlap function can be obtained.

Thanks in advance for answering these questions,

Marina

1. Hi Marina.

If I understand correctly, you have a set of trees read from file and you want to prune that set by removing the same set of taxa from each of the trees? This is easy, just do:

tips ## list of tips to prune
trees<-lapply(trees,drop.tip,tips)
class(trees)<-"multiPhylo"

Now you want to generate a stochastic character map, or set of maps, for each one of those trees under a model. For this, you can do the following:

x ## data for the discrete character in a vector with names
mtrees<-make.simmap(trees,x,model="ER",nsim=10)

This will have the effect of generating 10 stochastic mapped trees for each phylogeny in trees. So, for instance, if you have 100 trees, and use make.simmap with nsim=10, then you will end up with a sample of 1,000 stochastic mapped trees in an object of class "multiPhylo".

Let me know if this helps. All the best, Liam

6. Hello Dr. Revell, I'm trying to use read/write simmap so that I can create many, large maps on a remote server and save them for analyses on my personal computer, but I'm having trouble getting the functions to work.

I've cooked up a test case below:

set.seed(1) # Set seed for reproducibility
# Create PB tree with random discrete traits
pb.tree = pbtree(b=1, d=0.2, n=40)
pb.traits = as.matrix(sample(x = c(0, 1, 2, 3), size = length(pb.tree\$tip.label), replace = TRUE))
rownames(pb.traits) = pb.tree\$tip.label

# Create simmap
pb.simmap = make.simmap(pb.tree, pb.traits[, 1], model="ARD")

# Plot it to make sure it went as planned
cols = setNames(c("#029386", "#7570b3", "#d95f02", "#cb0162"), sort(unique(pb.traits[, 1])))
plot(pb.simmap, cols, type = "phylogram", fsize=0.5, ftype="i")
tiplabels(pie = to.matrix(pb.traits, sort(unique(pb.traits))), piecol = cols, cex=0.3)
add.simmap.legend(colors = cols, prompt = FALSE, x = 0.5*par()\$usr[1], y = 2*max(nodeHeights(pb.simmap)), fsize=1)

# Export simmap using the original tree and the mapped edges
export.as.xml(file = '~/Downloads/pbtree.xml', trees = pb.tree, X = pb.simmap\$mapped.edge)
# Import simmap from above