## Friday, February 27, 2015

### Stochastic mapping when tip states are uncertain - revisited

Stimulated by a user inquiry, here is a quick re-hash of ancestral character estimation using the phytools function `make.simmap` when tip states are uncertain.

Firstly, let's simulate some data with the property of uncertainty in the tip values - that is, we do not know the states for some extant taxa in the tree. The way this is expressed is as a prior probability distribution on the states for those tips that we do not know. For instance, if our character has two states `"a"` & `"b"`, and we are completely ignorant of the values of certain species in the tree, then we might say that the prior probability of being in each of states `"a"` or `"b"` was `0.5`.

``````## load phytools
library(phytools)
## simulate stochastic pure-birth tree
tree<-pbtree(n=26,tip.label=LETTERS,scale=1)
## generate character transition matrix
Q<-matrix(c(-1,1,1,-1),2,2)
rownames(Q)<-colnames(Q)<-letters[1:2]
Q
``````
``````##    a  b
## a -1  1
## b  1 -1
``````
``````## simulate character
x<-sim.history(tree,Q)\$states
``````
``````## Done simulation(s).
``````
``````x
``````
``````##   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R
## "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "a" "a" "a" "b" "b" "b" "b"
##   S   T   U   V   W   X   Y   Z
## "b" "b" "b" "a" "a" "a" "a" "a"
``````

Next, we have to add some uncertainty. Arbitrarily, let's say that we do not know the states of five of the twenty-six taxa in our tree. For the simulation, let's choose the five taxa with missing data at random:

``````## first encode our original data as a matrix
x<-to.matrix(x,seq=letters[1:2])
x
``````
``````##   a b
## A 0 1
## B 0 1
## C 0 1
## D 0 1
## E 0 1
## F 0 1
## G 0 1
## H 0 1
## I 0 1
## J 0 1
## K 0 1
## L 1 0
## M 1 0
## N 1 0
## O 0 1
## P 0 1
## Q 0 1
## R 0 1
## S 0 1
## T 0 1
## U 0 1
## V 1 0
## W 1 0
## X 1 0
## Y 1 0
## Z 1 0
``````
``````## now set some of these taxa to be uncertain
x[sample(1:26,5),]<-rep(0.5,2)
x
``````
``````##     a   b
## A 0.0 1.0
## B 0.0 1.0
## C 0.0 1.0
## D 0.0 1.0
## E 0.0 1.0
## F 0.5 0.5
## G 0.0 1.0
## H 0.5 0.5
## I 0.0 1.0
## J 0.0 1.0
## K 0.0 1.0
## L 1.0 0.0
## M 1.0 0.0
## N 1.0 0.0
## O 0.0 1.0
## P 0.0 1.0
## Q 0.0 1.0
## R 0.5 0.5
## S 0.0 1.0
## T 0.5 0.5
## U 0.0 1.0
## V 1.0 0.0
## W 1.0 0.0
## X 0.5 0.5
## Y 1.0 0.0
## Z 1.0 0.0
``````

Finally, let's generate some stochastic character maps using the phytools function `make.simmap`:

``````trees<-make.simmap(tree,x,nsim=100)
``````
``````## make.simmap is sampling character histories conditioned on the transition matrix
## Q =
``````
``````##            a          b
## a -0.9951156  0.9951156
## b  0.9951156 -0.9951156
``````
``````## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
``````
``````##   a   b
## 0.5 0.5
``````
``````## Done.
``````
``````trees
``````
``````## 100 phylogenetic trees
``````

If we want to obtain the posterior probabilities at nodes or tips, we can use the function `describe.simmap`:

``````obj<-describe.simmap(trees)
obj
``````
``````## 100 trees with a mapped discrete character with states:
##  a, b
##
## trees have 7.6 changes between states on average
##
## changes are of the following types:
##       a,b  b,a
## x->y 3.55 4.05
##
## mean total time spent in each state is:
##              a         b    total
## raw  2.4524900 5.0739916 7.526482
## prop 0.3258481 0.6741519 1.000000
``````
``````plot(obj)
``````

The matrices of posterior probabilities at nodes and tips are stored in the two matrices `obj\$ace` and `obj\$tips`, respectively. Check it out:

``````obj\$ace
``````
``````##       a    b
## 27 0.35 0.65
## 28 0.07 0.93
## 29 0.05 0.95
## 30 0.05 0.95
## 31 0.13 0.87
## 32 0.11 0.89
## 33 0.00 1.00
## 34 0.00 1.00
## 35 0.00 1.00
## 36 0.00 1.00
## 37 0.00 1.00
## 38 0.00 1.00
## 39 0.00 1.00
## 40 0.00 1.00
## 41 0.59 0.41
## 42 1.00 0.00
## 43 0.00 1.00
## 44 0.07 0.93
## 45 0.49 0.51
## 46 0.27 0.73
## 47 0.88 0.12
## 48 0.89 0.11
## 49 0.83 0.17
## 50 0.95 0.05
## 51 1.00 0.00
``````
``````obj\$tips
``````
``````##      a    b
## A 0.00 1.00
## B 0.00 1.00
## C 0.00 1.00
## D 0.00 1.00
## E 0.00 1.00
## F 0.09 0.91
## G 0.00 1.00
## H 0.01 0.99
## I 0.00 1.00
## J 0.00 1.00
## K 0.00 1.00
## L 1.00 0.00
## M 1.00 0.00
## N 1.00 0.00
## O 0.00 1.00
## P 0.00 1.00
## Q 0.00 1.00
## R 0.20 0.80
## S 0.00 1.00
## T 0.42 0.58
## U 0.00 1.00
## V 1.00 0.00
## W 1.00 0.00
## X 0.86 0.14
## Y 1.00 0.00
## Z 1.00 0.00
``````

Of course, the posterior probabilities at the tips will normally be different than the prior probabilities, as we would expect.

Another possibility, of course, for visualizing the variability across maps at nodes & tips is the phytools function `densityMap`.

``````densityMap(trees,outline=TRUE)
``````
``````## sorry - this might take a while; please be patient
``````

Finally, if we didn't know that the function `describe.simmap` existed, it would not be too difficult to get these same values using the function `getStates` from phytools, as follows:

``````## first nodes
X<-getStates(trees)
levs<-sort(unique(as.vector(X)))
ace<-t(apply(X,1,function(x,l) summary(factor(x,levels=l)),l=levs))/ncol(X)
ace
``````
``````##       a    b
## 27 0.35 0.65
## 28 0.07 0.93
## 29 0.05 0.95
## 30 0.05 0.95
## 31 0.13 0.87
## 32 0.11 0.89
## 33 0.00 1.00
## 34 0.00 1.00
## 35 0.00 1.00
## 36 0.00 1.00
## 37 0.00 1.00
## 38 0.00 1.00
## 39 0.00 1.00
## 40 0.00 1.00
## 41 0.59 0.41
## 42 1.00 0.00
## 43 0.00 1.00
## 44 0.07 0.93
## 45 0.49 0.51
## 46 0.27 0.73
## 47 0.88 0.12
## 48 0.89 0.11
## 49 0.83 0.17
## 50 0.95 0.05
## 51 1.00 0.00
``````
``````## now tips
X<-getStates(trees,"tips")
levs<-sort(unique(as.vector(X)))
tips<-t(apply(X,1,function(x,l) summary(factor(x,levels=l)),l=levs))/ncol(X)
tips
``````
``````##      a    b
## A 0.00 1.00
## B 0.00 1.00
## C 0.00 1.00
## D 0.00 1.00
## E 0.00 1.00
## F 0.09 0.91
## G 0.00 1.00
## H 0.01 0.99
## I 0.00 1.00
## J 0.00 1.00
## K 0.00 1.00
## L 1.00 0.00
## M 1.00 0.00
## N 1.00 0.00
## O 0.00 1.00
## P 0.00 1.00
## Q 0.00 1.00
## R 0.20 0.80
## S 0.00 1.00
## T 0.42 0.58
## U 0.00 1.00
## V 1.00 0.00
## W 1.00 0.00
## X 0.86 0.14
## Y 1.00 0.00
## Z 1.00 0.00
``````

That's it.

#### 3 comments:

1. Hey Liam,

Completely off-topic!

Maybe you've posted about this, but how you make such nice posts with R code blocks in Blogger? I've been trying to do this lately with rmarkdown in Rstudio but I can't quite figure it out...

-Dave

1. Hi Dave.

I write the posts in R markdown & then I build the .html using knit2html in the knitr package. Then I simply open the .html file & copy and paste the HTML code into the blogger window. The images are embedded, so there is no need to link .png files or anything. Incredible, right?

I have an example of this workflow in the Dryad entry for my new paper (Revell et al. In press) here. I posted the .Rmd and the .html output.

All the best, Liam

2. Interesting; I got weird formatting bugs doing that with Rstudio's knit to HTML, I'll try knit2html and see if it makes a difference. Thanks for the advice, Liam!
-Dave

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.