Monday, August 24, 2015

Removing node labels from a Newick string

Today on R-sig-phylo:

"I'm currently wanting to make some changes to some phylogenies in R by reading in the newick text as a string, rather than as a phylo object. The reason is that the trees are large (~10,000 tips and another set with ~5000 tips) and I must make changes to a complete set of these tree (i.e. 10,000 trees). Currently, these trees have node labels, which I'd like to remove.

"Essentially what I'd like to do is substitute ")?:" with "):" or simply delete "?", where "?" is any and all characters that occur between the 'close' parenthesis and the colon. So far I found I could use the function 'sub' but I'd like to make the replacements in one fell swoop, without knowing what the node labels are in advance. Also the sub function seems to only replace the first occurrence of the pattern rather than all matches in a string. Any suggestions would be greatly appreciated!"

Here's one solution using strsplit:

## generate a Newick string
library(phytools)
tree<-rtree(n=10)
## just so we can look at the Newick string more easily
tree$edge.length<-round(tree$edge.length,2)
tree$node.label<-paste("node",1:9,sep="")
tree
## 
## Phylogenetic tree with 10 tips and 9 internal nodes.
## 
## Tip labels:
##  t9, t3, t1, t7, t2, t6, ...
## Node labels:
##  node1, node2, node3, node4, node5, node6, ...
## 
## Rooted; includes branch lengths.
text<-write.tree(tree)
text
## [1] "((t9:0.13,t3:0.18)node2:0.13,((t1:0.96,(t7:0.91,t2:0.4)node5:0.88)node4:0.19,(((t6:0.24,(t5:0.03,t8:0.18)node9:0.48)node8:0.13,t10:0.71)node7:0.84,t4:0.7)node6:0.89)node3:0.74)node1;"
strip.nodelabels<-function(text){
    obj<-strsplit(text,"")[[1]]
    cp<-grep(")",obj)
    csc<-c(grep(":",obj),length(obj))
    exc<-cbind(cp,sapply(cp,function(x,y) y[which(y>x)[1]],y=csc))
    exc<-exc[(exc[,2]-exc[,1])>1,]
    inc<-rep(TRUE,length(obj))
    if(nrow(exc)>0) for(i in 1:nrow(exc)) 
        inc[(exc[i,1]+1):(exc[i,2]-1)]<-FALSE
    paste(obj[inc],collapse="")
}
strip.nodelabels(text)
## [1] "((t9:0.13,t3:0.18):0.13,((t1:0.96,(t7:0.91,t2:0.4):0.88):0.19,(((t6:0.24,(t5:0.03,t8:0.18):0.48):0.13,t10:0.71):0.84,t4:0.7):0.89):0.74);"

It even works fine if some node labels are missing:

tree$node.label[c(2,4,6)]<-""
text<-write.tree(tree)
text
## [1] "((t9:0.13,t3:0.18):0.13,((t1:0.96,(t7:0.91,t2:0.4)node5:0.88):0.19,(((t6:0.24,(t5:0.03,t8:0.18)node9:0.48)node8:0.13,t10:0.71)node7:0.84,t4:0.7):0.89)node3:0.74)node1;"
strip.nodelabels(text)
## [1] "((t9:0.13,t3:0.18):0.13,((t1:0.96,(t7:0.91,t2:0.4):0.88):0.19,(((t6:0.24,(t5:0.03,t8:0.18):0.48):0.13,t10:0.71):0.84,t4:0.7):0.89):0.74);"

We can see how it does for large trees:

tree<-rtree(n=5000)
tree$node.label<-paste("node",1:4999,sep="")
tree
## 
## Phylogenetic tree with 5000 tips and 4999 internal nodes.
## 
## Tip labels:
##  t176, t515, t4374, t3812, t4823, t3087, ...
## Node labels:
##  node1, node2, node3, node4, node5, node6, ...
## 
## Rooted; includes branch lengths.
text<-write.tree(tree)
system.time(text<-strip.nodelabels(text))
##    user  system elapsed 
##    1.00    0.01    1.11

Not super fast. Compare it to reading the tree, setting node labels to NULL, and then writing back to text string:

foo<-function(text){
    tree<-read.tree(text=text)
    tree$node.label<-NULL
    write.tree(tree)
}
text<-write.tree(tree)
system.time(text<-foo(text))
##    user  system elapsed 
##    1.69    0.04    1.82

So, it's about twice as fast or so.

No comments:

Post a Comment