Sunday, August 30, 2015

New S3 plot method for rateshift

I just added a new S3 plotting method for objects of class "rateshift" created by the phytools function rateshift.

This function attempts to fit different Brownian rates of evolutionary change to a user specified number of different 'eras' across the history of the user tree, without specifying when each era should begin & end.

The method is quite simple - and probably interesting to some users - but as I have never published an article describing or using this approach, it probably ranks among the neglected methods of the phytools package.

The new plot method visualizes the different fitted eras on the tree, and includes a color gradient for the fitted rates.

Here's a quick (or, not so quick, if you try to run it) example of what I mean:

library(devtools)
install_github("liamrevell/phytools",quiet=TRUE)
library(phytools)

First, let's simulate some data with strong rate shifts over time:

set.seed(99)
tree<-pbtree(n=50,scale=100)
x<-sim.rates(tree<-make.era.map(tree,c(0,60,85)),
setNames(c(5,20,1),1:3))
## these are our simulated regimes:
plot(tree,setNames(c("purple","red","blue"),1:3),fsize=0.8,ftype="i",
mar=c(3.1,0.1,0.1,0.1))
axis(1) Now let's fit our different models:

## one rate model (the first two models are easy to fit)
fit1<-rateshift(tree,x,niter=1)
## Optimizing. Please wait.
fit1
## ML 1-rate model:
##  s^2(1)  se(1)   k   logL
## value    7.4779  1.4956  2   -206.1193
##
## This is a one-rate model.
##
## R thinks it has found the ML solution.
plot(fit1,fsize=0.7,ftype="i",mar=c(3.1,0.1,0.1,0.1),lwd=3)
axis(1) ## two rate model
fit2<-rateshift(tree,x,nrates=2,niter=1)
## Optimizing. Please wait.
fit2
## ML 2-rate model:
##  s^2(1)  se(1)   s^2(2)  se(2)   k   logL
## value    13.8709 3.3576  0.526   0.2121  4   -193.311
##
## Shift point(s) between regimes (height above root):
##  1|2 se(1|2)
## value    88.4271 0.0436
##
## R thinks it has found the ML solution.
plot(fit2,fsize=0.7,ftype="i",mar=c(3.1,0.1,0.1,0.1),lwd=3)
axis(1) ## three rate model (this was our generating model)
fit3<-rateshift(tree,x,nrates=3,niter=50)
## Optimization progress:
## |..................................................|
## Done.
fit3
## ML 3-rate model:
##  s^2(1)  se(1)   s^2(2)  se(2)   s^2(3)  se(3)   k   logL
## value    3e-04   NaN 17.1992 5.3099  0.5412  0.2269  6   -190.4806
##
## Shift point(s) between regimes (height above root):
##  1|2 se(1|2) 2|3 se(2|3)
## value    48.8427 11.3994 88.4255 0.0819
##
## Optimization may not have converged.
plot(fit3,fsize=0.7,ftype="i",mar=c(3.1,0.1,0.1,0.1),lwd=3)
axis(1) ## four rate model
fit4<-rateshift(tree,x,nrates=4,niter=50)
## Optimization progress:
## |..................................................|
## Done.
fit4
## ML 4-rate model:
##  s^2(1)  se(1)   s^2(2)  se(2)   s^2(3)  se(3)   s^2(4)  se(4)   k   logL
## value    0.0021  NaN 29.1904 12.3032 6.4966  4.7628  0.5255  0.209   8   -189.4281
##
## Shift point(s) between regimes (height above root):
##  1|2 se(1|2) 2|3 se(2|3) 3|4 se(3|4)
## value    53.5568 6.0482  74.1195 0.2341  89.9531 3.9633
##
## Optimization may not have converged.
plot(fit4,fsize=0.7,ftype="i",mar=c(3.1,0.1,0.1,0.1),lwd=3)
axis(1) The parameter estimates for the three-rate model are quite close - at least in terms of the estimated shift points - to their generating values, which is cool.

We can also compute & compare AIC scores for each model:

aics<-setNames(c(AIC(fit1),AIC(fit2),AIC(fit3),AIC(fit4)),
paste(1:4,"-rate",sep=""))
aics
##   1-rate   2-rate   3-rate   4-rate
## 416.2387 394.6220 392.9611 394.8562

We see that, in addition to having estimated shift points close to the simulated values, the generating, three-rate model has the best support - although it seems to be quite difficult to achieve convergence for the models with more shift point…. Since the S3 methods are so nice, maybe I should work a little harder at getting the optimization to actually work well!

Saturday, August 29, 2015

Adding a tip to a sister taxon on the tree

Here's a quick primer on how to attach a tip not in our tree to it's putative sister taxon half-way along the terminal edge length to its sister taxon.

## load phytools & simulate
library(phytools)
tree<-pbtree(n=26,tip.label=LETTERS)
plotTree(tree) ## add tip "A2" to sister taxon "A"
tip<-"A2"
sister<-"A"
tree<-bind.tip(tree,tip,where=which(tree\$tip.label==sister),
position=0.5*tree\$edge.length[which(tree\$edge[,2]==
which(tree\$tip.label==sister))])
## add tip "T2" to "T"
tip<-"T2"
sister<-"T"
tree<-bind.tip(tree,tip,where=which(tree\$tip.label==sister),
position=0.5*tree\$edge.length[which(tree\$edge[,2]==
which(tree\$tip.label==sister))])
plotTree(tree) This works with ultrametric trees. If our trees are not ultrametric, then we also have to include the terminal edge length that we want to have for our tip to be added in each case.

OK, we've done it!

Monday, August 24, 2015

Removing node labels from a Newick string

Today on R-sig-phylo:

"I'm currently wanting to make some changes to some phylogenies in R by reading in the newick text as a string, rather than as a phylo object. The reason is that the trees are large (~10,000 tips and another set with ~5000 tips) and I must make changes to a complete set of these tree (i.e. 10,000 trees). Currently, these trees have node labels, which I'd like to remove.

"Essentially what I'd like to do is substitute ")?:" with "):" or simply delete "?", where "?" is any and all characters that occur between the 'close' parenthesis and the colon. So far I found I could use the function 'sub' but I'd like to make the replacements in one fell swoop, without knowing what the node labels are in advance. Also the sub function seems to only replace the first occurrence of the pattern rather than all matches in a string. Any suggestions would be greatly appreciated!"

Here's one solution using strsplit:

## generate a Newick string
library(phytools)
tree<-rtree(n=10)
## just so we can look at the Newick string more easily
tree\$edge.length<-round(tree\$edge.length,2)
tree\$node.label<-paste("node",1:9,sep="")
tree
##
## Phylogenetic tree with 10 tips and 9 internal nodes.
##
## Tip labels:
##  t9, t3, t1, t7, t2, t6, ...
## Node labels:
##  node1, node2, node3, node4, node5, node6, ...
##
## Rooted; includes branch lengths.
text<-write.tree(tree)
text
##  "((t9:0.13,t3:0.18)node2:0.13,((t1:0.96,(t7:0.91,t2:0.4)node5:0.88)node4:0.19,(((t6:0.24,(t5:0.03,t8:0.18)node9:0.48)node8:0.13,t10:0.71)node7:0.84,t4:0.7)node6:0.89)node3:0.74)node1;"
strip.nodelabels<-function(text){
obj<-strsplit(text,"")[]
cp<-grep(")",obj)
csc<-c(grep(":",obj),length(obj))
exc<-cbind(cp,sapply(cp,function(x,y) y[which(y>x)],y=csc))
exc<-exc[(exc[,2]-exc[,1])>1,]
inc<-rep(TRUE,length(obj))
if(nrow(exc)>0) for(i in 1:nrow(exc))
inc[(exc[i,1]+1):(exc[i,2]-1)]<-FALSE
paste(obj[inc],collapse="")
}
strip.nodelabels(text)
##  "((t9:0.13,t3:0.18):0.13,((t1:0.96,(t7:0.91,t2:0.4):0.88):0.19,(((t6:0.24,(t5:0.03,t8:0.18):0.48):0.13,t10:0.71):0.84,t4:0.7):0.89):0.74);"

It even works fine if some node labels are missing:

tree\$node.label[c(2,4,6)]<-""
text<-write.tree(tree)
text
##  "((t9:0.13,t3:0.18):0.13,((t1:0.96,(t7:0.91,t2:0.4)node5:0.88):0.19,(((t6:0.24,(t5:0.03,t8:0.18)node9:0.48)node8:0.13,t10:0.71)node7:0.84,t4:0.7):0.89)node3:0.74)node1;"
strip.nodelabels(text)
##  "((t9:0.13,t3:0.18):0.13,((t1:0.96,(t7:0.91,t2:0.4):0.88):0.19,(((t6:0.24,(t5:0.03,t8:0.18):0.48):0.13,t10:0.71):0.84,t4:0.7):0.89):0.74);"

We can see how it does for large trees:

tree<-rtree(n=5000)
tree\$node.label<-paste("node",1:4999,sep="")
tree
##
## Phylogenetic tree with 5000 tips and 4999 internal nodes.
##
## Tip labels:
##  t176, t515, t4374, t3812, t4823, t3087, ...
## Node labels:
##  node1, node2, node3, node4, node5, node6, ...
##
## Rooted; includes branch lengths.
text<-write.tree(tree)
system.time(text<-strip.nodelabels(text))
##    user  system elapsed
##    1.00    0.01    1.11

Not super fast. Compare it to reading the tree, setting node labels to NULL, and then writing back to text string:

foo<-function(text){