## Friday, February 11, 2011

In a comment on one of my earlier posts Carl Boettiger remarked that my read.simmap() R function (to read stochastic character mapped phylogenies) destroys the ordering of states along each branch and retains only the time spent in each state on each edge. This ordering is unimportant for analyses based around an undirected Brownian motion model - such as, for instance, the model of O'Meara et al. (2006); however, ordering could be important for other models, such as the Ornstein-Uhlenbeck model (e.g., Butler & King 2004). Carl is a sharp guy and his point is well taken, so now that I have finally settled into my new digs here (not-so-subtle advertisement - grad students welcome!), I decided to try to tackle this problem today.

The first thing I realized was that it would not be practical to use the same "work-around" algorithm that I used for v0.1 of read.simmap() and described here. I was going to have to figure out how to read phylogenies into memory in R without being able to take advantage of the {ape} function read.tree()!

To convey what this problem involves, it is first important to understand how a phylogenetic tree is stored in memory in R by {ape}. read.tree() stores the phylogeny as list of class "phylo" with 3 basic elements, as follows:

\$Nnode : this is an integer containing the number of internal nodes in the tree;
\$tip.label : this is a vector containing the labels for all the tips in the tree; and finally
\$edge : this is a matrix of dimensions (\$Nnode+length(tip.label)-1 x 2) which contains the node numbers for all the internal parent (left column) and daughter (right column) nodes in the tree.

By convention in {ape}, terminal nodes are numbered 1:n for n species, where the numbers correspond to the order of the tip labels in \$tip.label; and internal nodes are labeled (n+1):(\$Nnode+length(tip.label)) starting with the root node of the tree.

[Note that this is really different from how phylogeny storage is typically programmed in C; where we would use a network of linked custom data types called "structures. An example of this can be seen on my website, here.]

To give you an idea of what this would look like then, the following tree (see figure) in Newick format:

would have the following elements:

> tree\$Nnode
[1] 3
> tree\$tip.label
[1] "Human" "Chimp" "Gorilla" "Monkey"
> tree\$edge
[,1] [,2]
[1,] 5 6
[2,] 6 7
[3,] 7 1
[4,] 7 2
[5,] 6 3
[6,] 5 4

More on this very soon. . . . .