## Saturday, December 25, 2010

### Reading SIMMAP format trees, part II In my last post I described a new function (presently available from my beta version R phylogenetics page) to read SIMMAP v1.0 format trees into R. Trees in this format have roughly the following form:
((A:{0,0.3:1,0.4:0,0.3},B:{0,1.0}):{0,1.0},(C:{0,0.25:1,0.75},D:{0,1.0}):{0,1.0});
The tree above would correspond to the visual mapping of the figure, right, in which blue branches represent state "0" and red branches state "1."
The most obvious way to code a function to read the tree above into R would be to take Paradis's "ape" function read.tree() and modify it to take the { parenthetical } form of branch lengths. However, the most obvious solution is not always the easiest. What I quickly realized was that it would be much easier to first parse the tree into two (or more) Newick format trees for each state of the binary character. All trees in this case would have the same topology, but their branch lengths would correspond to the time spent in each state on that tree. For instance from the tree above, we would obtain the Newick trees as follows:
"0" : ((A:0.6,B:1.0):1.0,(C:0.25,D:1.0):1.0); and
"1" : ((A:0.4,B:0.0):0.0,(C:0.75,D:0.0):0.0);.
Note that doing this destroys all information about the order of states along edges. This information is unimportant for some analyses (such as O'Meara et al. 2006), but might be critical for others.
We now simply call read.tree() on each of the state trees above to create a "multiPhylo" object which contains a list of the trees for each mapped state. To create our SIMMAP tree in memory, then, we first copy the topology of any of the state trees in our list into our mapped tree (remember, they all have the same topology), and then we create a matrix mapped.edge with the time on each edge (rows) from each state tree (columns). These are just the \$edge.length branch length vectors from each tree. We then add across columns of mapped.edge to get the total length of each edge.
The product here is an object of class "phylo" but with the additional element, our matrix from earlier \$mapped.edge, which contains the times spent in each state on each edge of the tree. And we're done!
I should note that in creating this function I did in fact borrow and modify a chunk of code from the "ape" function read.nexus(). This allowed me to read not only Phylip style tree files, as given above, but also the more complicated Nexus style output files created by SIMMAP.