Friday, February 27, 2015

Stochastic mapping when tip states are uncertain - revisited

Stimulated by a user inquiry, here is a quick re-hash of ancestral character estimation using the phytools function make.simmap when tip states are uncertain.

Firstly, let's simulate some data with the property of uncertainty in the tip values - that is, we do not know the states for some extant taxa in the tree. The way this is expressed is as a prior probability distribution on the states for those tips that we do not know. For instance, if our character has two states "a" & "b", and we are completely ignorant of the values of certain species in the tree, then we might say that the prior probability of being in each of states "a" or "b" was 0.5.

## load phytools
library(phytools)
## simulate stochastic pure-birth tree
tree<-pbtree(n=26,tip.label=LETTERS,scale=1)
## generate character transition matrix
Q<-matrix(c(-1,1,1,-1),2,2)
rownames(Q)<-colnames(Q)<-letters[1:2]
Q
##    a  b
## a -1  1
## b  1 -1
## simulate character
x<-sim.history(tree,Q)$states
## Done simulation(s).
x
##   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R 
## "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "a" "a" "a" "b" "b" "b" "b" 
##   S   T   U   V   W   X   Y   Z 
## "b" "b" "b" "a" "a" "a" "a" "a"

Next, we have to add some uncertainty. Arbitrarily, let's say that we do not know the states of five of the twenty-six taxa in our tree. For the simulation, let's choose the five taxa with missing data at random:

## first encode our original data as a matrix
x<-to.matrix(x,seq=letters[1:2])
x
##   a b
## A 0 1
## B 0 1
## C 0 1
## D 0 1
## E 0 1
## F 0 1
## G 0 1
## H 0 1
## I 0 1
## J 0 1
## K 0 1
## L 1 0
## M 1 0
## N 1 0
## O 0 1
## P 0 1
## Q 0 1
## R 0 1
## S 0 1
## T 0 1
## U 0 1
## V 1 0
## W 1 0
## X 1 0
## Y 1 0
## Z 1 0
## now set some of these taxa to be uncertain
x[sample(1:26,5),]<-rep(0.5,2)
x
##     a   b
## A 0.0 1.0
## B 0.0 1.0
## C 0.0 1.0
## D 0.0 1.0
## E 0.0 1.0
## F 0.5 0.5
## G 0.0 1.0
## H 0.5 0.5
## I 0.0 1.0
## J 0.0 1.0
## K 0.0 1.0
## L 1.0 0.0
## M 1.0 0.0
## N 1.0 0.0
## O 0.0 1.0
## P 0.0 1.0
## Q 0.0 1.0
## R 0.5 0.5
## S 0.0 1.0
## T 0.5 0.5
## U 0.0 1.0
## V 1.0 0.0
## W 1.0 0.0
## X 0.5 0.5
## Y 1.0 0.0
## Z 1.0 0.0

Finally, let's generate some stochastic character maps using the phytools function make.simmap:

trees<-make.simmap(tree,x,nsim=100)
## make.simmap is sampling character histories conditioned on the transition matrix
## Q =
##            a          b
## a -0.9951156  0.9951156
## b  0.9951156 -0.9951156
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
##   a   b 
## 0.5 0.5
## Done.
trees
## 100 phylogenetic trees

If we want to obtain the posterior probabilities at nodes or tips, we can use the function describe.simmap:

obj<-describe.simmap(trees)
obj
## 100 trees with a mapped discrete character with states:
##  a, b 
## 
## trees have 7.6 changes between states on average
## 
## changes are of the following types:
##       a,b  b,a
## x->y 3.55 4.05
## 
## mean total time spent in each state is:
##              a         b    total
## raw  2.4524900 5.0739916 7.526482
## prop 0.3258481 0.6741519 1.000000
plot(obj)

plot of chunk unnamed-chunk-4

The matrices of posterior probabilities at nodes and tips are stored in the two matrices obj$ace and obj$tips, respectively. Check it out:

obj$ace
##       a    b
## 27 0.35 0.65
## 28 0.07 0.93
## 29 0.05 0.95
## 30 0.05 0.95
## 31 0.13 0.87
## 32 0.11 0.89
## 33 0.00 1.00
## 34 0.00 1.00
## 35 0.00 1.00
## 36 0.00 1.00
## 37 0.00 1.00
## 38 0.00 1.00
## 39 0.00 1.00
## 40 0.00 1.00
## 41 0.59 0.41
## 42 1.00 0.00
## 43 0.00 1.00
## 44 0.07 0.93
## 45 0.49 0.51
## 46 0.27 0.73
## 47 0.88 0.12
## 48 0.89 0.11
## 49 0.83 0.17
## 50 0.95 0.05
## 51 1.00 0.00
obj$tips
##      a    b
## A 0.00 1.00
## B 0.00 1.00
## C 0.00 1.00
## D 0.00 1.00
## E 0.00 1.00
## F 0.09 0.91
## G 0.00 1.00
## H 0.01 0.99
## I 0.00 1.00
## J 0.00 1.00
## K 0.00 1.00
## L 1.00 0.00
## M 1.00 0.00
## N 1.00 0.00
## O 0.00 1.00
## P 0.00 1.00
## Q 0.00 1.00
## R 0.20 0.80
## S 0.00 1.00
## T 0.42 0.58
## U 0.00 1.00
## V 1.00 0.00
## W 1.00 0.00
## X 0.86 0.14
## Y 1.00 0.00
## Z 1.00 0.00

Of course, the posterior probabilities at the tips will normally be different than the prior probabilities, as we would expect.

Another possibility, of course, for visualizing the variability across maps at nodes & tips is the phytools function densityMap.

densityMap(trees,outline=TRUE)
## sorry - this might take a while; please be patient

plot of chunk unnamed-chunk-6

Finally, if we didn't know that the function describe.simmap existed, it would not be too difficult to get these same values using the function getStates from phytools, as follows:

## first nodes
X<-getStates(trees)
levs<-sort(unique(as.vector(X)))
ace<-t(apply(X,1,function(x,l) summary(factor(x,levels=l)),l=levs))/ncol(X)
ace
##       a    b
## 27 0.35 0.65
## 28 0.07 0.93
## 29 0.05 0.95
## 30 0.05 0.95
## 31 0.13 0.87
## 32 0.11 0.89
## 33 0.00 1.00
## 34 0.00 1.00
## 35 0.00 1.00
## 36 0.00 1.00
## 37 0.00 1.00
## 38 0.00 1.00
## 39 0.00 1.00
## 40 0.00 1.00
## 41 0.59 0.41
## 42 1.00 0.00
## 43 0.00 1.00
## 44 0.07 0.93
## 45 0.49 0.51
## 46 0.27 0.73
## 47 0.88 0.12
## 48 0.89 0.11
## 49 0.83 0.17
## 50 0.95 0.05
## 51 1.00 0.00
## now tips
X<-getStates(trees,"tips")
levs<-sort(unique(as.vector(X)))
tips<-t(apply(X,1,function(x,l) summary(factor(x,levels=l)),l=levs))/ncol(X)
tips
##      a    b
## A 0.00 1.00
## B 0.00 1.00
## C 0.00 1.00
## D 0.00 1.00
## E 0.00 1.00
## F 0.09 0.91
## G 0.00 1.00
## H 0.01 0.99
## I 0.00 1.00
## J 0.00 1.00
## K 0.00 1.00
## L 1.00 0.00
## M 1.00 0.00
## N 1.00 0.00
## O 0.00 1.00
## P 0.00 1.00
## Q 0.00 1.00
## R 0.20 0.80
## S 0.00 1.00
## T 0.42 0.58
## U 0.00 1.00
## V 1.00 0.00
## W 1.00 0.00
## X 0.86 0.14
## Y 1.00 0.00
## Z 1.00 0.00

That's it.

3 comments:

  1. Hey Liam,

    Completely off-topic!

    Maybe you've posted about this, but how you make such nice posts with R code blocks in Blogger? I've been trying to do this lately with rmarkdown in Rstudio but I can't quite figure it out...

    -Dave

    ReplyDelete
    Replies
    1. Hi Dave.

      I write the posts in R markdown & then I build the .html using knit2html in the knitr package. Then I simply open the .html file & copy and paste the HTML code into the blogger window. The images are embedded, so there is no need to link .png files or anything. Incredible, right?

      I have an example of this workflow in the Dryad entry for my new paper (Revell et al. In press) here. I posted the .Rmd and the .html output.

      All the best, Liam

      Delete
    2. Interesting; I got weird formatting bugs doing that with Rstudio's knit to HTML, I'll try knit2html and see if it makes a difference. Thanks for the advice, Liam!
      -Dave

      Delete