Monday, August 24, 2015

Removing node labels from a Newick string

Today on R-sig-phylo:

"I'm currently wanting to make some changes to some phylogenies in R by reading in the newick text as a string, rather than as a phylo object. The reason is that the trees are large (~10,000 tips and another set with ~5000 tips) and I must make changes to a complete set of these tree (i.e. 10,000 trees). Currently, these trees have node labels, which I'd like to remove.

"Essentially what I'd like to do is substitute ")?:" with "):" or simply delete "?", where "?" is any and all characters that occur between the 'close' parenthesis and the colon. So far I found I could use the function 'sub' but I'd like to make the replacements in one fell swoop, without knowing what the node labels are in advance. Also the sub function seems to only replace the first occurrence of the pattern rather than all matches in a string. Any suggestions would be greatly appreciated!"

Here's one solution using strsplit:

## generate a Newick string
library(phytools)
tree<-rtree(n=10)
## just so we can look at the Newick string more easily
tree$edge.length<-round(tree$edge.length,2)
tree$node.label<-paste("node",1:9,sep="")
tree
## 
## Phylogenetic tree with 10 tips and 9 internal nodes.
## 
## Tip labels:
##  t9, t3, t1, t7, t2, t6, ...
## Node labels:
##  node1, node2, node3, node4, node5, node6, ...
## 
## Rooted; includes branch lengths.
text<-write.tree(tree)
text
## [1] "((t9:0.13,t3:0.18)node2:0.13,((t1:0.96,(t7:0.91,t2:0.4)node5:0.88)node4:0.19,(((t6:0.24,(t5:0.03,t8:0.18)node9:0.48)node8:0.13,t10:0.71)node7:0.84,t4:0.7)node6:0.89)node3:0.74)node1;"
strip.nodelabels<-function(text){
    obj<-strsplit(text,"")[[1]]
    cp<-grep(")",obj)
    csc<-c(grep(":",obj),length(obj))
    exc<-cbind(cp,sapply(cp,function(x,y) y[which(y>x)[1]],y=csc))
    exc<-exc[(exc[,2]-exc[,1])>1,]
    inc<-rep(TRUE,length(obj))
    if(nrow(exc)>0) for(i in 1:nrow(exc)) 
        inc[(exc[i,1]+1):(exc[i,2]-1)]<-FALSE
    paste(obj[inc],collapse="")
}
strip.nodelabels(text)
## [1] "((t9:0.13,t3:0.18):0.13,((t1:0.96,(t7:0.91,t2:0.4):0.88):0.19,(((t6:0.24,(t5:0.03,t8:0.18):0.48):0.13,t10:0.71):0.84,t4:0.7):0.89):0.74);"

It even works fine if some node labels are missing:

tree$node.label[c(2,4,6)]<-""
text<-write.tree(tree)
text
## [1] "((t9:0.13,t3:0.18):0.13,((t1:0.96,(t7:0.91,t2:0.4)node5:0.88):0.19,(((t6:0.24,(t5:0.03,t8:0.18)node9:0.48)node8:0.13,t10:0.71)node7:0.84,t4:0.7):0.89)node3:0.74)node1;"
strip.nodelabels(text)
## [1] "((t9:0.13,t3:0.18):0.13,((t1:0.96,(t7:0.91,t2:0.4):0.88):0.19,(((t6:0.24,(t5:0.03,t8:0.18):0.48):0.13,t10:0.71):0.84,t4:0.7):0.89):0.74);"

We can see how it does for large trees:

tree<-rtree(n=5000)
tree$node.label<-paste("node",1:4999,sep="")
tree
## 
## Phylogenetic tree with 5000 tips and 4999 internal nodes.
## 
## Tip labels:
##  t176, t515, t4374, t3812, t4823, t3087, ...
## Node labels:
##  node1, node2, node3, node4, node5, node6, ...
## 
## Rooted; includes branch lengths.
text<-write.tree(tree)
system.time(text<-strip.nodelabels(text))
##    user  system elapsed 
##    1.00    0.01    1.11

Not super fast. Compare it to reading the tree, setting node labels to NULL, and then writing back to text string:

foo<-function(text){
    tree<-read.tree(text=text)
    tree$node.label<-NULL
    write.tree(tree)
}
text<-write.tree(tree)
system.time(text<-foo(text))
##    user  system elapsed 
##    1.69    0.04    1.82

So, it's about twice as fast or so.

Friday, August 21, 2015

densityMap with non-binary trait data

I (not-so) recently go the following question about the phytools function densityMap:

Is it possible to make same as your method 1 'Visualizing the aggregate result of stochastic mapping, in your paper (2013) : “Two new graphical methods for mapping trait evolution on phylogenies”, for a discrete trait which is not a binary trait?

Presently densityMap, which creates a visualization of the posterior probability density from stochastic mapping of a binary character on the tree, does not work for any more than two characters. This is mostly because it is hard to envision visualizing color in more than one dimension! (One possibly for a three-state character would be to put the probabilities on a 3D RGB scale - but this adds only one dimension to the analysis anyway.)

The only solution right now is to use mergeMappedStates on the "simmap" trees & then plot "A" vs. "not A", "B" vs. "not B", "C" vs. "not C", etc.

Here is a quick demo of how to do that, which would have to be varied depending on the specific character states that were being modeled:

library(phytools)
packageVersion("phytools")
## [1] '0.4.99'
## simulate some data
Q<-matrix(c(-1,0.5,0.5,
    0.5,-1,0.5,
    0.5,0.5,-1),3,3)
rownames(Q)<-colnames(Q)<-LETTERS[1:3]
## true history
tree<-sim.history(pbtree(n=50,scale=1),Q,anc="A")
## Done simulation(s).
tree
## 
## Phylogenetic tree with 50 tips and 49 internal nodes.
## 
## Tip labels:
##  t32, t33, t23, t13, t14, t7, ...
## 
## The tree includes a mapped, 3-state discrete character with states:
##  A, B, C
## 
## Rooted; includes branch lengths.
plot(tree,fsize=0.8)
## no colors provided. using the following legend:
##        A        B        C 
##  "black"    "red" "green3"

plot of chunk unnamed-chunk-1

## simulated tip data
x<-getStates(tree,"tips")
trees<-make.simmap(tree,x,nsim=100,model="ER")
## make.simmap is sampling character histories conditioned on the transition matrix
## Q =
##            A          B          C
## A -0.8584049  0.4292025  0.4292025
## B  0.4292025 -0.8584049  0.4292025
## C  0.4292025  0.4292025 -0.8584049
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
##         A         B         C 
## 0.3333333 0.3333333 0.3333333
## Done.
trees
## 100 phylogenetic trees with mapped discrete characters
print(trees[1:10],details=TRUE)
## 10 phylogenetic trees with mapped discrete characters
## tree 1 : 50 tips, 3 mapped states
## tree 2 : 50 tips, 3 mapped states
## tree 3 : 50 tips, 3 mapped states
## tree 4 : 50 tips, 3 mapped states
## tree 5 : 50 tips, 3 mapped states
## tree 6 : 50 tips, 3 mapped states
## tree 7 : 50 tips, 3 mapped states
## tree 8 : 50 tips, 3 mapped states
## tree 9 : 50 tips, 3 mapped states
## tree 10 : 50 tips, 3 mapped states
## merge B & C
tBC<-lapply(trees,mergeMappedStates,c("B","C"),"B or C")
class(tBC)<-c("multiSimmap","multiPhylo")
print(tBC[1:10],details=TRUE)
## 10 phylogenetic trees with mapped discrete characters
## tree 1 : 50 tips, 2 mapped states
## tree 2 : 50 tips, 2 mapped states
## tree 3 : 50 tips, 2 mapped states
## tree 4 : 50 tips, 2 mapped states
## tree 5 : 50 tips, 2 mapped states
## tree 6 : 50 tips, 2 mapped states
## tree 7 : 50 tips, 2 mapped states
## tree 8 : 50 tips, 2 mapped states
## tree 9 : 50 tips, 2 mapped states
## tree 10 : 50 tips, 2 mapped states
obj<-densityMap(tBC,plot=FALSE,states=c("B or C","A"))
## sorry - this might take a while; please be patient
obj
## Object of class "densityMap" containing:
## 
## (1) A phylogenetic tree with 50 tips and 49 internal nodes.
## 
## (2) The mapped posterior density of a discrete binary character with states (B or C, A).
plot(obj,outline=TRUE,lwd=4,fsize=c(0.7,1))

plot of chunk unnamed-chunk-2

Providing the argument states=c("B or C","A") is an important undocumented detail here, because otherwise branches of high posterior probability would be high posterior probability of "B or C"!

Similarly:

tAC<-lapply(trees,mergeMappedStates,c("A","C"),"A or C")
class(tAC)<-c("multiSimmap","multiPhylo")
obj<-densityMap(tAC,plot=FALSE,states=c("A or C","B"))
## sorry - this might take a while; please be patient
obj
## Object of class "densityMap" containing:
## 
## (1) A phylogenetic tree with 50 tips and 49 internal nodes.
## 
## (2) The mapped posterior density of a discrete binary character with states (A or C, B).
plot(obj,outline=TRUE,lwd=4,fsize=c(0.7,1))

plot of chunk unnamed-chunk-3

tAB<-lapply(trees,mergeMappedStates,c("A","B"),"A or B")
class(tAB)<-c("multiSimmap","multiPhylo")
obj<-densityMap(tAB,plot=FALSE,states=c("A or B","C"))
## sorry - this might take a while; please be patient
obj
## Object of class "densityMap" containing:
## 
## (1) A phylogenetic tree with 50 tips and 49 internal nodes.
## 
## (2) The mapped posterior density of a discrete binary character with states (A or B, C).
plot(obj,outline=TRUE,lwd=4,fsize=c(0.7,1))

plot of chunk unnamed-chunk-3

Thanks for the question!

Thursday, August 20, 2015

Adding node labels to a phylomorphospace plot

I just posted an update to the phytools function phylomorphospace which now sets the environmental variable "last_plot.phylo" so that phylomorphospace can now be more easily combined with ape functions such as nodelabels and tiplabels.

For instance, imagine we have a discrete character that we want to overlay on a tree projected into morphospace using the function phylomorphospace. At present, we can use stochastic mapping style trees of class "simmap" to project a single, reconstructed discrete character history. However, with nodelabels we can also/instead show posterior probabilities at nodes. So, for example:

library(devtools)
install_github("liamrevell/phytools",quiet=TRUE)
library(phytools)
## simulate bivariate data with different covariance structure
## depending on a discrete character
a<-matrix(c(1,0,0,1),2,2)
b<-matrix(c(2,1.8,1.8,2),2,2)
dimnames(a)<-dimnames(b)<-
    list(c("trait 1","trait 2"),c("trait 1","trait 2"))
V<-list(a=a,b=b)
Q<-matrix(c(-1,1,1,-1),2,2)
rownames(Q)<-colnames(Q)<-names(V)
map.tree<-sim.history(tree<-pbtree(n=40,scale=1),Q,anc="a")
## Done simulation(s).
cols<-setNames(c("blue","red"),names(V))
plot(map.tree,colors=cols,ftype="off",ylim=c(-1,Ntip(tree)))
add.simmap.legend(colors=cols,x=0.1,y=0,vertical=FALSE,prompt=FALSE)

plot of chunk unnamed-chunk-1

X<-sim.corrs(map.tree,vcv=V)
colnames(X)<-colnames(V[[1]])
phylomorphospace(map.tree,X,ftype="off",colors=cols,node.size=rep(0,2))
x<-map.tree$states
x
## t22 t23 t37 t38 t15 t31 t32 t27 t28 t26 t13 t14  t9 t20 t21  t6  t7  t8 
## "a" "a" "a" "a" "b" "a" "a" "a" "a" "a" "a" "a" "b" "a" "a" "a" "a" "b" 
## t33 t34 t19  t4  t5  t2 t16 t17  t3 t18 t35 t36 t39 t40  t1 t10 t11 t24 
## "b" "a" "b" "b" "a" "a" "a" "b" "a" "a" "a" "a" "b" "b" "a" "b" "b" "b" 
## t25 t12 t29 t30 
## "b" "b" "b" "b"
obj<-ace(x,tree,type="discrete",model="ER")
nodelabels(pie=obj$lik.anc,piecol=cols,cex=0.6)
tiplabels(pie=to.matrix(x,names(V)),piecol=cols,cex=0.4)

plot of chunk unnamed-chunk-2

## or, alternatively:
phylomorphospace(tree,X,ftype="off",node.size=rep(0,2))
nodelabels(pie=obj$lik.anc,piecol=cols,cex=0.6)
tiplabels(pie=to.matrix(x,names(V)),piecol=cols,cex=0.4)

plot of chunk unnamed-chunk-2

That's it.