Today on R-sig-phylo:
"I'm currently wanting to make some changes to some phylogenies in R by reading in the newick text as a string, rather than as a phylo object. The reason is that the trees are large (~10,000 tips and another set with ~5000 tips) and I must make changes to a complete set of these tree (i.e. 10,000 trees). Currently, these trees have node labels, which I'd like to remove.
"Essentially what I'd like to do is substitute ")?:" with "):" or simply delete "?", where "?" is any and all characters that occur between the 'close' parenthesis and the colon. So far I found I could use the function 'sub' but I'd like to make the replacements in one fell swoop, without knowing what the node labels are in advance. Also the sub function seems to only replace the first occurrence of the pattern rather than all matches in a string. Any suggestions would be greatly appreciated!"
Here's one solution using strsplit
:
## generate a Newick string
library(phytools)
tree<-rtree(n=10)
## just so we can look at the Newick string more easily
tree$edge.length<-round(tree$edge.length,2)
tree$node.label<-paste("node",1:9,sep="")
tree
##
## Phylogenetic tree with 10 tips and 9 internal nodes.
##
## Tip labels:
## t9, t3, t1, t7, t2, t6, ...
## Node labels:
## node1, node2, node3, node4, node5, node6, ...
##
## Rooted; includes branch lengths.
text<-write.tree(tree)
text
## [1] "((t9:0.13,t3:0.18)node2:0.13,((t1:0.96,(t7:0.91,t2:0.4)node5:0.88)node4:0.19,(((t6:0.24,(t5:0.03,t8:0.18)node9:0.48)node8:0.13,t10:0.71)node7:0.84,t4:0.7)node6:0.89)node3:0.74)node1;"
strip.nodelabels<-function(text){
obj<-strsplit(text,"")[[1]]
cp<-grep(")",obj)
csc<-c(grep(":",obj),length(obj))
exc<-cbind(cp,sapply(cp,function(x,y) y[which(y>x)[1]],y=csc))
exc<-exc[(exc[,2]-exc[,1])>1,]
inc<-rep(TRUE,length(obj))
if(nrow(exc)>0) for(i in 1:nrow(exc))
inc[(exc[i,1]+1):(exc[i,2]-1)]<-FALSE
paste(obj[inc],collapse="")
}
strip.nodelabels(text)
## [1] "((t9:0.13,t3:0.18):0.13,((t1:0.96,(t7:0.91,t2:0.4):0.88):0.19,(((t6:0.24,(t5:0.03,t8:0.18):0.48):0.13,t10:0.71):0.84,t4:0.7):0.89):0.74);"
It even works fine if some node labels are missing:
tree$node.label[c(2,4,6)]<-""
text<-write.tree(tree)
text
## [1] "((t9:0.13,t3:0.18):0.13,((t1:0.96,(t7:0.91,t2:0.4)node5:0.88):0.19,(((t6:0.24,(t5:0.03,t8:0.18)node9:0.48)node8:0.13,t10:0.71)node7:0.84,t4:0.7):0.89)node3:0.74)node1;"
strip.nodelabels(text)
## [1] "((t9:0.13,t3:0.18):0.13,((t1:0.96,(t7:0.91,t2:0.4):0.88):0.19,(((t6:0.24,(t5:0.03,t8:0.18):0.48):0.13,t10:0.71):0.84,t4:0.7):0.89):0.74);"
We can see how it does for large trees:
tree<-rtree(n=5000)
tree$node.label<-paste("node",1:4999,sep="")
tree
##
## Phylogenetic tree with 5000 tips and 4999 internal nodes.
##
## Tip labels:
## t176, t515, t4374, t3812, t4823, t3087, ...
## Node labels:
## node1, node2, node3, node4, node5, node6, ...
##
## Rooted; includes branch lengths.
text<-write.tree(tree)
system.time(text<-strip.nodelabels(text))
## user system elapsed
## 1.00 0.01 1.11
Not super fast. Compare it to reading the tree, setting node labels to
NULL
, and then writing back to text string:
foo<-function(text){
tree<-read.tree(text=text)
tree$node.label<-NULL
write.tree(tree)
}
text<-write.tree(tree)
system.time(text<-foo(text))
## user system elapsed
## 1.69 0.04 1.82
So, it's about twice as fast or so.
Hi, google has brought me to your package, which I think is cool. While waiting to install Rstudio on a new system I managed to solve the task with sed:
ReplyDeletesed -r 's/\)([.0-9]+):/\):/g' treefile
Sure, but will this work if you have other numeric characters in your Newick string (such as branch lengths or tip labels containing numbers)?
DeleteThis does not work with me. The error mensage says:
ReplyDelete"Error in strsplit(text, "") : non-character argument"
Oh, I forgot to mention that only using "tree$node.label<-NULL" worked fine. Thanks.
ReplyDelete