Friday, February 27, 2015

Stochastic mapping when tip states are uncertain - revisited

Stimulated by a user inquiry, here is a quick re-hash of ancestral character estimation using the phytools function make.simmap when tip states are uncertain.

Firstly, let's simulate some data with the property of uncertainty in the tip values - that is, we do not know the states for some extant taxa in the tree. The way this is expressed is as a prior probability distribution on the states for those tips that we do not know. For instance, if our character has two states "a" & "b", and we are completely ignorant of the values of certain species in the tree, then we might say that the prior probability of being in each of states "a" or "b" was 0.5.

## load phytools
library(phytools)
## simulate stochastic pure-birth tree
tree<-pbtree(n=26,tip.label=LETTERS,scale=1)
## generate character transition matrix
Q<-matrix(c(-1,1,1,-1),2,2)
rownames(Q)<-colnames(Q)<-letters[1:2]
Q
##    a  b
## a -1  1
## b  1 -1
## simulate character
x<-sim.history(tree,Q)$states
## Done simulation(s).
x
##   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R 
## "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "a" "a" "a" "b" "b" "b" "b" 
##   S   T   U   V   W   X   Y   Z 
## "b" "b" "b" "a" "a" "a" "a" "a"

Next, we have to add some uncertainty. Arbitrarily, let's say that we do not know the states of five of the twenty-six taxa in our tree. For the simulation, let's choose the five taxa with missing data at random:

## first encode our original data as a matrix
x<-to.matrix(x,seq=letters[1:2])
x
##   a b
## A 0 1
## B 0 1
## C 0 1
## D 0 1
## E 0 1
## F 0 1
## G 0 1
## H 0 1
## I 0 1
## J 0 1
## K 0 1
## L 1 0
## M 1 0
## N 1 0
## O 0 1
## P 0 1
## Q 0 1
## R 0 1
## S 0 1
## T 0 1
## U 0 1
## V 1 0
## W 1 0
## X 1 0
## Y 1 0
## Z 1 0
## now set some of these taxa to be uncertain
x[sample(1:26,5),]<-rep(0.5,2)
x
##     a   b
## A 0.0 1.0
## B 0.0 1.0
## C 0.0 1.0
## D 0.0 1.0
## E 0.0 1.0
## F 0.5 0.5
## G 0.0 1.0
## H 0.5 0.5
## I 0.0 1.0
## J 0.0 1.0
## K 0.0 1.0
## L 1.0 0.0
## M 1.0 0.0
## N 1.0 0.0
## O 0.0 1.0
## P 0.0 1.0
## Q 0.0 1.0
## R 0.5 0.5
## S 0.0 1.0
## T 0.5 0.5
## U 0.0 1.0
## V 1.0 0.0
## W 1.0 0.0
## X 0.5 0.5
## Y 1.0 0.0
## Z 1.0 0.0

Finally, let's generate some stochastic character maps using the phytools function make.simmap:

trees<-make.simmap(tree,x,nsim=100)
## make.simmap is sampling character histories conditioned on the transition matrix
## Q =
##            a          b
## a -0.9951156  0.9951156
## b  0.9951156 -0.9951156
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
##   a   b 
## 0.5 0.5
## Done.
trees
## 100 phylogenetic trees

If we want to obtain the posterior probabilities at nodes or tips, we can use the function describe.simmap:

obj<-describe.simmap(trees)
obj
## 100 trees with a mapped discrete character with states:
##  a, b 
## 
## trees have 7.6 changes between states on average
## 
## changes are of the following types:
##       a,b  b,a
## x->y 3.55 4.05
## 
## mean total time spent in each state is:
##              a         b    total
## raw  2.4524900 5.0739916 7.526482
## prop 0.3258481 0.6741519 1.000000
plot(obj)

plot of chunk unnamed-chunk-4

The matrices of posterior probabilities at nodes and tips are stored in the two matrices obj$ace and obj$tips, respectively. Check it out:

obj$ace
##       a    b
## 27 0.35 0.65
## 28 0.07 0.93
## 29 0.05 0.95
## 30 0.05 0.95
## 31 0.13 0.87
## 32 0.11 0.89
## 33 0.00 1.00
## 34 0.00 1.00
## 35 0.00 1.00
## 36 0.00 1.00
## 37 0.00 1.00
## 38 0.00 1.00
## 39 0.00 1.00
## 40 0.00 1.00
## 41 0.59 0.41
## 42 1.00 0.00
## 43 0.00 1.00
## 44 0.07 0.93
## 45 0.49 0.51
## 46 0.27 0.73
## 47 0.88 0.12
## 48 0.89 0.11
## 49 0.83 0.17
## 50 0.95 0.05
## 51 1.00 0.00
obj$tips
##      a    b
## A 0.00 1.00
## B 0.00 1.00
## C 0.00 1.00
## D 0.00 1.00
## E 0.00 1.00
## F 0.09 0.91
## G 0.00 1.00
## H 0.01 0.99
## I 0.00 1.00
## J 0.00 1.00
## K 0.00 1.00
## L 1.00 0.00
## M 1.00 0.00
## N 1.00 0.00
## O 0.00 1.00
## P 0.00 1.00
## Q 0.00 1.00
## R 0.20 0.80
## S 0.00 1.00
## T 0.42 0.58
## U 0.00 1.00
## V 1.00 0.00
## W 1.00 0.00
## X 0.86 0.14
## Y 1.00 0.00
## Z 1.00 0.00

Of course, the posterior probabilities at the tips will normally be different than the prior probabilities, as we would expect.

Another possibility, of course, for visualizing the variability across maps at nodes & tips is the phytools function densityMap.

densityMap(trees,outline=TRUE)
## sorry - this might take a while; please be patient

plot of chunk unnamed-chunk-6

Finally, if we didn't know that the function describe.simmap existed, it would not be too difficult to get these same values using the function getStates from phytools, as follows:

## first nodes
X<-getStates(trees)
levs<-sort(unique(as.vector(X)))
ace<-t(apply(X,1,function(x,l) summary(factor(x,levels=l)),l=levs))/ncol(X)
ace
##       a    b
## 27 0.35 0.65
## 28 0.07 0.93
## 29 0.05 0.95
## 30 0.05 0.95
## 31 0.13 0.87
## 32 0.11 0.89
## 33 0.00 1.00
## 34 0.00 1.00
## 35 0.00 1.00
## 36 0.00 1.00
## 37 0.00 1.00
## 38 0.00 1.00
## 39 0.00 1.00
## 40 0.00 1.00
## 41 0.59 0.41
## 42 1.00 0.00
## 43 0.00 1.00
## 44 0.07 0.93
## 45 0.49 0.51
## 46 0.27 0.73
## 47 0.88 0.12
## 48 0.89 0.11
## 49 0.83 0.17
## 50 0.95 0.05
## 51 1.00 0.00
## now tips
X<-getStates(trees,"tips")
levs<-sort(unique(as.vector(X)))
tips<-t(apply(X,1,function(x,l) summary(factor(x,levels=l)),l=levs))/ncol(X)
tips
##      a    b
## A 0.00 1.00
## B 0.00 1.00
## C 0.00 1.00
## D 0.00 1.00
## E 0.00 1.00
## F 0.09 0.91
## G 0.00 1.00
## H 0.01 0.99
## I 0.00 1.00
## J 0.00 1.00
## K 0.00 1.00
## L 1.00 0.00
## M 1.00 0.00
## N 1.00 0.00
## O 0.00 1.00
## P 0.00 1.00
## Q 0.00 1.00
## R 0.20 0.80
## S 0.00 1.00
## T 0.42 0.58
## U 0.00 1.00
## V 1.00 0.00
## W 1.00 0.00
## X 0.86 0.14
## Y 1.00 0.00
## Z 1.00 0.00

That's it.

Sunday, February 15, 2015

Perhaps the last (wishful thinking?) bug fix for reroot function for trees with node labels

I have just posted yet another bug fix for the phytools function reroot for trees containing node labels. This bug (reported by a helpful user) had the effect of causing one of the edge lengths descended from the new root to be wrong.

All are basically related to the same underlying issue which is that the ape function drop.tip(...,trim.internal=FALSE) will label the stem of a trimmed clade with the tip label "NA" when node labels are absent, but the node label itself when present. All of the internals used by reroot were originally built around the assumption that the trimmed edge would be labeled "NA". Unfortunately, this created issues for multiple internals - which I have been peeling back & fixing one-by-one.

This update is also in the latest (non-CRAN) phytools version which, as always, can be obtained from the phytools webpage.

Here is a quick demo of the bug & its fix.

First reroot behaving properly for a tree without node labels:

library(phytools)
## simulate a tree
set.seed(1)
tree<-rtree(n=26,tip.label=LETTERS)
plotTree(tree)
nodelabels()

plot of chunk unnamed-chunk-1

## here with a tree without node labels
t1<-reroot(tree,31,0.5*tree$edge.length[which(tree$edge[,2]==31)])
plotTree(t1)

plot of chunk unnamed-chunk-1

## we re-rooting in the middle of an edge so
## these should be equal
t1$edge.length[which(t1$edge[,1]==(Ntip(t1)+1))]
## [1] 0.3661569 0.3661569

OK, now here it is misbehaving. The tree is re-rooted in the right spot, but the edge lengths descended from the root are wrong:

## add node labels
tree$node.label<-paste("n",1:tree$Nnode+Ntip(tree),sep="")
plotTree(tree)
nodelabels(tree$node.label)

plot of chunk unnamed-chunk-2

t2<-reroot(tree,31,0.5*tree$edge.length[which(tree$edge[,2]==31)])
plotTree(t2) ## look at the edge lengths
nodelabels(t2$node.label)

plot of chunk unnamed-chunk-2

## these should be equal, but they are not
t2$edge.length[which(t1$edge[,1]==(Ntip(t2)+1))]
## [1] 2.2862548 0.3661569

OK, now let's load the fix and try again:

detach("package:phytools", unload=TRUE)
install.packages("phytools_0.4-49.tar.gz",type="source")
## Installing package into 'C:/Users/Liam/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## inferring 'repos = NULL' from 'pkgs'
library(phytools)
t2<-reroot(tree,31,0.5*tree$edge.length[which(tree$edge[,2]==31)])
plotTree(t2)
nodelabels(t2$node.label)

plot of chunk unnamed-chunk-3

## these should be equal
t2$edge.length[which(t1$edge[,1]==(Ntip(t2)+1))]
## [1] 0.3661569 0.3661569

That's it.

Wednesday, February 11, 2015

Yet another bug fix for reroot for an input tree containing node labels

Recently, a phytools user correctly reported a bug in the phytools function reroot to re-root a tree along an edge when the input tree contained node labels. I thought I had identified it's cause and fixed it (details here). Specifically, the problem is caused by the fact that the function drop.tip(...,trim.internal=FALSE) will leave behind "NA" as a tip label for the trimmed clade when node labels are absent; however it will instead use the node labels if they are present. I fixed the (primarily internal) phytools function splitTree for this issue; however I failed to realize that I also needed to update the function drop.clade. I have now done this and posted the updated code here.

Let's try it with the old version & the new one:

library(phytools)
set.seed(1) ## set seed for reproducibility
tree<-rtree(n=26)
tree$tip.label<-LETTERS
plotTree(tree)

plot of chunk unnamed-chunk-1

node<-37
position=0.5*tree$edge.length[which(tree$edge[,2]==node)]
plotTree(reroot(tree,node,position)) ## works fine

plot of chunk unnamed-chunk-1

## now try with node labels
tree$node.label<-paste("n",1:tree$Nnode,sep="")
rerooted.tree<-reroot(tree,node,position) ## breaks
## Error in if (newroot == ROOT) {: argument is of length zero

Now let's update to the latest phytools version (not on CRAN) and re-attempt:

detach("package:phytools",unload=TRUE)
install.packages("phytools_0.4-48.tar.gz",type="source",repos=NULL)
## Installing package into 'C:/Users/Liam/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
library(phytools)
rerooted.tree<-reroot(tree,node,position)
plotTree(rerooted.tree)
nodelabels(rerooted.tree$node.label)

plot of chunk unnamed-chunk-2

The most recent working version of phytools can always be downloaded here.

That's it.