Monday, July 11, 2011

Writing a "phylo" object to a Newick string

Now that we can generate stochastic character mapped trees using make.simmap(), I thought it would be useful to write them to file. Unfortunately, although I have functions that can read Newick strings from file (e.g., read.newick() and read.simmap()), I don't yet have code to go the other way (i.e., to write a Newick string from a "phylo" object stored in memory). For more details on the Newick convention, check out Felsenstein's comprehensive description.

Ok, now, remember how a "phylo" object is stored. It is a list with three essential components: $edge, a matrix containing the node or tip numbers for the upper and lower node of each edge; $tip.label, a vector containing the labels for each tip; and, finally, $Nnode, an integer value indicating the number of internal nodes in the tree. Branch lengths are stored in the optional vector $edge.length, which contains all the branch lengths in tree tree in the order of $edge.
So, for instance, the Newick tree:
is plotted as follows:

and is represented in memory with the following four parts:
> tree$edge
[,1] [,2]
[1,] 9 10
[2,] 10 11
[3,] 11 1
[4,] 11 12
[5,] 12 13
[6,] 13 2
[7,] 13 3
[8,] 12 14
[9,] 14 4
[10,] 14 5
[11,] 10 6
[12,] 9 15
[13,] 15 7
[14,] 15 8
> tree$tip.label
[1] "2" "6" "7" "4" "5" "1" "3" "8"
> tree$Nnode
[1] 7
> tree$edge.length
[1] 0.17 0.24 1.55 0.92 0.47 0.16 0.16 0.41 0.22 0.22 1.79 1.47 0.49 0.49

The task of back-translating the object above to a Newick string is not too difficult. In fact, if the "phylo" object is in "cladewise" order (as above), then we can nearly just descend from the top to the bottom of the matrix $edge, writing the appropriate characters at each step. The reason I qualify this as nearly is because we can move up (i.e., from root to tip) the tree very easily by descending $edge, we do not as easily move back down the tree this way - which we need to do to close parentheses and assign branch lengths to internal edges of the Newick tree. To do this, we just need to write a simple while() loop which backs down the tree every time we might be ready to close out a clade.

The code for my simple tree writer is not particularly elegant, but it seems to work:

  string<-vector(); string[1]<-"("; j<-2
  for(i in 1:nrow(tree$edge)){
      string[j]<-tree$tip.label[tree$edge[i,2]]; j<-j+1
        string[j]<-paste(c(":",round(tree$edge.length[i],10)), collapse="")
      v<-which(tree$edge[,1]==tree$edge[i,1]); k<-i
        string[j]<-")"; j<-j+1
          string[j]<-paste(c(":",round(tree$edge.length[w],10)), collapse="")
        v<-which(tree$edge[,1]==tree$edge[w,1]); k<-w
      string[j]<-","; j<-j+1
    } else if(tree$edge[i,2]>=n){
      string[j]<-"("; j<-j+1
  if(is.null(tree$edge.length)) string<-c(string[1:(length(string)-1)], ";")
  else string<-c(string[1:(length(string)-2)],";")

Now, let's test it out against "ape"'s write.tree() for a random tree:
> tree<-rtree(n=100)
> tree$edge.length<-round(tree$edge.length,6) # to avoid precision differences
> write.tree(tree)==writeTree(tree)
[1] TRUE

Cool! Now we're ready to handle printing "simmap" trees.


  1. Hi! This is exactly what I was looking for!! I have a problem though.I get that the nrow of $edge are the $lengths, but I cannot visualize which numbers of the edge matrix correspond to the tree.

    Can you help me?


  2. whoah this blog is fantastic i really like reading your posts. Keep up the great paintings! You already know, lots of persons are searching around for this information, you could aid them greatly.
    Click on Cloud IT Zone and get your hosting Services.