Friday, February 27, 2015

Stochastic mapping when tip states are uncertain - revisited

Stimulated by a user inquiry, here is a quick re-hash of ancestral character estimation using the phytools function `make.simmap` when tip states are uncertain.

Firstly, let's simulate some data with the property of uncertainty in the tip values - that is, we do not know the states for some extant taxa in the tree. The way this is expressed is as a prior probability distribution on the states for those tips that we do not know. For instance, if our character has two states `"a"` & `"b"`, and we are completely ignorant of the values of certain species in the tree, then we might say that the prior probability of being in each of states `"a"` or `"b"` was `0.5`.

``````## load phytools
library(phytools)
## simulate stochastic pure-birth tree
tree<-pbtree(n=26,tip.label=LETTERS,scale=1)
## generate character transition matrix
Q<-matrix(c(-1,1,1,-1),2,2)
rownames(Q)<-colnames(Q)<-letters[1:2]
Q
``````
``````##    a  b
## a -1  1
## b  1 -1
``````
``````## simulate character
x<-sim.history(tree,Q)\$states
``````
``````## Done simulation(s).
``````
``````x
``````
``````##   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R
## "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "a" "a" "a" "b" "b" "b" "b"
##   S   T   U   V   W   X   Y   Z
## "b" "b" "b" "a" "a" "a" "a" "a"
``````

Next, we have to add some uncertainty. Arbitrarily, let's say that we do not know the states of five of the twenty-six taxa in our tree. For the simulation, let's choose the five taxa with missing data at random:

``````## first encode our original data as a matrix
x<-to.matrix(x,seq=letters[1:2])
x
``````
``````##   a b
## A 0 1
## B 0 1
## C 0 1
## D 0 1
## E 0 1
## F 0 1
## G 0 1
## H 0 1
## I 0 1
## J 0 1
## K 0 1
## L 1 0
## M 1 0
## N 1 0
## O 0 1
## P 0 1
## Q 0 1
## R 0 1
## S 0 1
## T 0 1
## U 0 1
## V 1 0
## W 1 0
## X 1 0
## Y 1 0
## Z 1 0
``````
``````## now set some of these taxa to be uncertain
x[sample(1:26,5),]<-rep(0.5,2)
x
``````
``````##     a   b
## A 0.0 1.0
## B 0.0 1.0
## C 0.0 1.0
## D 0.0 1.0
## E 0.0 1.0
## F 0.5 0.5
## G 0.0 1.0
## H 0.5 0.5
## I 0.0 1.0
## J 0.0 1.0
## K 0.0 1.0
## L 1.0 0.0
## M 1.0 0.0
## N 1.0 0.0
## O 0.0 1.0
## P 0.0 1.0
## Q 0.0 1.0
## R 0.5 0.5
## S 0.0 1.0
## T 0.5 0.5
## U 0.0 1.0
## V 1.0 0.0
## W 1.0 0.0
## X 0.5 0.5
## Y 1.0 0.0
## Z 1.0 0.0
``````

Finally, let's generate some stochastic character maps using the phytools function `make.simmap`:

``````trees<-make.simmap(tree,x,nsim=100)
``````
``````## make.simmap is sampling character histories conditioned on the transition matrix
## Q =
``````
``````##            a          b
## a -0.9951156  0.9951156
## b  0.9951156 -0.9951156
``````
``````## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
``````
``````##   a   b
## 0.5 0.5
``````
``````## Done.
``````
``````trees
``````
``````## 100 phylogenetic trees
``````

If we want to obtain the posterior probabilities at nodes or tips, we can use the function `describe.simmap`:

``````obj<-describe.simmap(trees)
obj
``````
``````## 100 trees with a mapped discrete character with states:
##  a, b
##
## trees have 7.6 changes between states on average
##
## changes are of the following types:
##       a,b  b,a
## x->y 3.55 4.05
##
## mean total time spent in each state is:
##              a         b    total
## raw  2.4524900 5.0739916 7.526482
## prop 0.3258481 0.6741519 1.000000
``````
``````plot(obj)
``````

The matrices of posterior probabilities at nodes and tips are stored in the two matrices `obj\$ace` and `obj\$tips`, respectively. Check it out:

``````obj\$ace
``````
``````##       a    b
## 27 0.35 0.65
## 28 0.07 0.93
## 29 0.05 0.95
## 30 0.05 0.95
## 31 0.13 0.87
## 32 0.11 0.89
## 33 0.00 1.00
## 34 0.00 1.00
## 35 0.00 1.00
## 36 0.00 1.00
## 37 0.00 1.00
## 38 0.00 1.00
## 39 0.00 1.00
## 40 0.00 1.00
## 41 0.59 0.41
## 42 1.00 0.00
## 43 0.00 1.00
## 44 0.07 0.93
## 45 0.49 0.51
## 46 0.27 0.73
## 47 0.88 0.12
## 48 0.89 0.11
## 49 0.83 0.17
## 50 0.95 0.05
## 51 1.00 0.00
``````
``````obj\$tips
``````
``````##      a    b
## A 0.00 1.00
## B 0.00 1.00
## C 0.00 1.00
## D 0.00 1.00
## E 0.00 1.00
## F 0.09 0.91
## G 0.00 1.00
## H 0.01 0.99
## I 0.00 1.00
## J 0.00 1.00
## K 0.00 1.00
## L 1.00 0.00
## M 1.00 0.00
## N 1.00 0.00
## O 0.00 1.00
## P 0.00 1.00
## Q 0.00 1.00
## R 0.20 0.80
## S 0.00 1.00
## T 0.42 0.58
## U 0.00 1.00
## V 1.00 0.00
## W 1.00 0.00
## X 0.86 0.14
## Y 1.00 0.00
## Z 1.00 0.00
``````

Of course, the posterior probabilities at the tips will normally be different than the prior probabilities, as we would expect.

Another possibility, of course, for visualizing the variability across maps at nodes & tips is the phytools function `densityMap`.

``````densityMap(trees,outline=TRUE)
``````
``````## sorry - this might take a while; please be patient
``````

Finally, if we didn't know that the function `describe.simmap` existed, it would not be too difficult to get these same values using the function `getStates` from phytools, as follows:

``````## first nodes
X<-getStates(trees)
levs<-sort(unique(as.vector(X)))
ace<-t(apply(X,1,function(x,l) summary(factor(x,levels=l)),l=levs))/ncol(X)
ace
``````
``````##       a    b
## 27 0.35 0.65
## 28 0.07 0.93
## 29 0.05 0.95
## 30 0.05 0.95
## 31 0.13 0.87
## 32 0.11 0.89
## 33 0.00 1.00
## 34 0.00 1.00
## 35 0.00 1.00
## 36 0.00 1.00
## 37 0.00 1.00
## 38 0.00 1.00
## 39 0.00 1.00
## 40 0.00 1.00
## 41 0.59 0.41
## 42 1.00 0.00
## 43 0.00 1.00
## 44 0.07 0.93
## 45 0.49 0.51
## 46 0.27 0.73
## 47 0.88 0.12
## 48 0.89 0.11
## 49 0.83 0.17
## 50 0.95 0.05
## 51 1.00 0.00
``````
``````## now tips
X<-getStates(trees,"tips")
levs<-sort(unique(as.vector(X)))
tips<-t(apply(X,1,function(x,l) summary(factor(x,levels=l)),l=levs))/ncol(X)
tips
``````
``````##      a    b
## A 0.00 1.00
## B 0.00 1.00
## C 0.00 1.00
## D 0.00 1.00
## E 0.00 1.00
## F 0.09 0.91
## G 0.00 1.00
## H 0.01 0.99
## I 0.00 1.00
## J 0.00 1.00
## K 0.00 1.00
## L 1.00 0.00
## M 1.00 0.00
## N 1.00 0.00
## O 0.00 1.00
## P 0.00 1.00
## Q 0.00 1.00
## R 0.20 0.80
## S 0.00 1.00
## T 0.42 0.58
## U 0.00 1.00
## V 1.00 0.00
## W 1.00 0.00
## X 0.86 0.14
## Y 1.00 0.00
## Z 1.00 0.00
``````

That's it.

Sunday, February 15, 2015

Perhaps the last (wishful thinking?) bug fix for reroot function for trees with node labels

I have just posted yet another bug fix for the phytools function `reroot` for trees containing node labels. This bug (reported by a helpful user) had the effect of causing one of the edge lengths descended from the new root to be wrong.

All are basically related to the same underlying issue which is that the ape function `drop.tip(...,trim.internal=FALSE)` will label the stem of a trimmed clade with the tip label `"NA"` when node labels are absent, but the node label itself when present. All of the internals used by `reroot` were originally built around the assumption that the trimmed edge would be labeled `"NA"`. Unfortunately, this created issues for multiple internals - which I have been peeling back & fixing one-by-one.

This update is also in the latest (non-CRAN) phytools version which, as always, can be obtained from the phytools webpage.

Here is a quick demo of the bug & its fix.

First `reroot` behaving properly for a tree without node labels:

``````library(phytools)
## simulate a tree
set.seed(1)
tree<-rtree(n=26,tip.label=LETTERS)
plotTree(tree)
nodelabels()
``````

``````## here with a tree without node labels
t1<-reroot(tree,31,0.5*tree\$edge.length[which(tree\$edge[,2]==31)])
plotTree(t1)
``````

``````## we re-rooting in the middle of an edge so
## these should be equal
t1\$edge.length[which(t1\$edge[,1]==(Ntip(t1)+1))]
``````
``````## [1] 0.3661569 0.3661569
``````

OK, now here it is misbehaving. The tree is re-rooted in the right spot, but the edge lengths descended from the root are wrong:

``````## add node labels
tree\$node.label<-paste("n",1:tree\$Nnode+Ntip(tree),sep="")
plotTree(tree)
nodelabels(tree\$node.label)
``````

``````t2<-reroot(tree,31,0.5*tree\$edge.length[which(tree\$edge[,2]==31)])
plotTree(t2) ## look at the edge lengths
nodelabels(t2\$node.label)
``````

``````## these should be equal, but they are not
t2\$edge.length[which(t1\$edge[,1]==(Ntip(t2)+1))]
``````
``````## [1] 2.2862548 0.3661569
``````

OK, now let's load the fix and try again:

``````detach("package:phytools", unload=TRUE)
install.packages("phytools_0.4-49.tar.gz",type="source")
``````
``````## Installing package into 'C:/Users/Liam/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## inferring 'repos = NULL' from 'pkgs'
``````
``````library(phytools)
t2<-reroot(tree,31,0.5*tree\$edge.length[which(tree\$edge[,2]==31)])
plotTree(t2)
nodelabels(t2\$node.label)
``````

``````## these should be equal
t2\$edge.length[which(t1\$edge[,1]==(Ntip(t2)+1))]
``````
``````## [1] 0.3661569 0.3661569
``````

That's it.

Wednesday, February 11, 2015

Yet another bug fix for reroot for an input tree containing node labels

Recently, a phytools user correctly reported a bug in the phytools function `reroot` to re-root a tree along an edge when the input tree contained node labels. I thought I had identified it's cause and fixed it (details here). Specifically, the problem is caused by the fact that the function `drop.tip(...,trim.internal=FALSE)` will leave behind `"NA"` as a tip label for the trimmed clade when node labels are absent; however it will instead use the node labels if they are present. I fixed the (primarily internal) phytools function `splitTree` for this issue; however I failed to realize that I also needed to update the function `drop.clade`. I have now done this and posted the updated code here.

Let's try it with the old version & the new one:

``````library(phytools)
set.seed(1) ## set seed for reproducibility
tree<-rtree(n=26)
tree\$tip.label<-LETTERS
plotTree(tree)
``````

``````node<-37
position=0.5*tree\$edge.length[which(tree\$edge[,2]==node)]
plotTree(reroot(tree,node,position)) ## works fine
``````

``````## now try with node labels
tree\$node.label<-paste("n",1:tree\$Nnode,sep="")
rerooted.tree<-reroot(tree,node,position) ## breaks
``````
``````## Error in if (newroot == ROOT) {: argument is of length zero
``````

Now let's update to the latest phytools version (not on CRAN) and re-attempt:

``````detach("package:phytools",unload=TRUE)
install.packages("phytools_0.4-48.tar.gz",type="source",repos=NULL)
``````
``````## Installing package into 'C:/Users/Liam/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
``````
``````library(phytools)
rerooted.tree<-reroot(tree,node,position)
plotTree(rerooted.tree)
nodelabels(rerooted.tree\$node.label)
``````

The most recent working version of phytools can always be downloaded here.

That's it.