## Saturday, March 30, 2013

### Estimating ancestral states when tips are uncertain

I recently received an email with the following content (among other things):

"Have you ever considered developing algorithms for inferring character states for taxa that are missing data? I looked at phytools and did not see an application that does this."

Well, first of all, at least one method in phytools, the function ancThresh for ancestral character estimation under the threshold model, does allow for uncertain tip states by permitting input of known states or a matrix of prior probabilities on tips. In the case where data were totally unknown (i.e., missing) we would just provide a completely uninformative prior probability distribution for the state.

It is relatively straightforward to use the same general idea if sampling ancestral character histories using the method of stochastic character mapping. As discussed in prior posts (e.g., 1, 2) it is easy to get the marginal probabilities from joint sampling of ancestral states using stochastic mapping. I have posted an updated version of make.simmap that can take as input either a character state vector of a matrix of prior probabilities for tip states.

Implementation of this was surprisingly easy. First, I modified the internally called function apeAce - in which I have adapted code from ape's ace to compute the scaled conditional likelihoods for internal nodes. Here, I just need to adjust the code slightly so that we use the prior probabilities for tip states during post-order tree traversal during the pruning algorithm. Next, I modified the mapping code so that tip edges are treated the same way as internal edges. In this way we can assign node and tip states stochastically via pre-order tree traversal in the first step of the mapping algorithm described in Bollback (2006). Piece of cake!

One of the interesting features of both this method & ancThresh is that it allows us to estimate the posterior probability that uncertain tips are in any state.

Here's a quick demo. This also uses an updated version of desribe.simmap. Both updated functions are in the latest minor release of phytools (phytools 0.2-34), which can be downloaded & installed from source:

> # load phytools & dependencies
> require(phytools)
> packageVersion("phytools")
[1] ‘0.2.34’
> # this function turns a discrete character into a
> # binary matrix
> to.matrix<-function(x,seq) sapply(seq,"==",x)*1
>
> # simulate tree & data
> tree<-pbtree(n=25,scale=1)
> Q<-matrix(c(-2,1,1,1,-2,1,1,1,-2),3,3)
> rownames(Q)<-colnames(Q)<-letters[1:3]
> x<-sim.history(tree,Q)\$states
> x
t9 t10  t7 t18 t19  t1 t14 t15  t3  t8 t11 t16 t17 t13 t20
"a" "c" "c" "c" "c" "a" "b" "b" "c" "a" "c" "c" "c" "c" "b"
t21  t6  t4  t5 t24 t25 t12  t2 t22 t23
"c" "a" "c" "c" "c" "c" "c" "c" "a" "a"
> # convert to binary matrix
> PP<-to.matrix(x,letters[1:3])
> PP
a b c
t9  1 0 0
t10 0 0 1
t7  0 0 1
t18 0 0 1
t19 0 0 1
t1  1 0 0
t14 0 1 0
t15 0 1 0
t3  0 0 1
t8  1 0 0
t11 0 0 1
t16 0 0 1
t17 0 0 1
t13 0 0 1
t20 0 1 0
t21 0 0 1
t6  1 0 0
t4  0 0 1
t5  0 0 1
t24 0 0 1
t25 0 0 1
t12 0 0 1
t2  0 0 1
t22 1 0 0
t23 1 0 0

OK now let's make some of our tip states (arbitrarily) uncertain:

> # t20 might be 'a' or 'b'
> PP["t20",2:3]<-c(0.5,0.5)
> # we are totally ignorant about t2 and t23
> PP["t2",]<-rep(1/3,3)
> PP["t23",]<-rep(1/3,3)
> PP
a     b     c
t9  1.000 0.000 0.000
t10 0.000 0.000 1.000
t7  0.000 0.000 1.000
t18 0.000 0.000 1.000
t19 0.000 0.000 1.000
t1  1.000 0.000 0.000
t14 0.000 1.000 0.000
t15 0.000 1.000 0.000
t3  0.000 0.000 1.000
t8  1.000 0.000 0.000
t11 0.000 0.000 1.000
t16 0.000 0.000 1.000
t17 0.000 0.000 1.000
t13 0.000 0.000 1.000
t20 0.000 0.500 0.500
t21 0.000 0.000 1.000
t6  1.000 0.000 0.000
t4  0.000 0.000 1.000
t5  0.000 0.000 1.000
t24 0.000 0.000 1.000
t25 0.000 0.000 1.000
t12 0.000 0.000 1.000
t2  0.333 0.333 0.333
t22 1.000 0.000 0.000
t23 0.333 0.333 0.333

Finally, let's do stochastic character mapping (and estimate ancestral states) using either our true data, or our data with uncertainty and missing data for some tips.

> # true data
> atrees<-make.simmap(tree,x,nsim=200,model="ER")
make.simmap is sampling character histories conditioned on the transition matrix
Q =
a          b          c
a -1.4409903  0.7204951  0.7204951
b  0.7204951 -1.4409903  0.7204951
c  0.7204951  0.7204951 -1.4409903
(estimated using likelihood);
and state frequencies (used to sample the root node following Bollback 2006)
pi =
a         b         c
0.3333333 0.3333333 0.3333333
> # data with uncertainty
> btrees<-make.simmap(tree,PP,nsim=200,model="ER")
make.simmap is sampling character histories conditioned on the transition matrix
Q =
a          b          c
a -1.1736451  0.5868225  0.5868225
b  0.5868225 -1.1736451  0.5868225
c  0.5868225  0.5868225 -1.1736451
(estimated using likelihood);
and state frequencies (used to sample the root node following Bollback 2006)
pi =
a         b         c
0.3333333 0.3333333 0.3333333
> # compute posterior probabilities at nodes & tips
> AA<-describe.simmap(atrees,plot=TRUE)
200 trees with a mapped discrete character with states:
a, b, c

trees have 14.255 changes between states on average

changes are of the following types:
a,b  a,c   b,a  b,c  c,a  c,b
x->y 1.37 3.82 1.045 1.67 4.01 2.34

mean total time spent in each state is:
a          b         c    total
raw  2.2430532 0.78932529 5.2821745 8.314553
prop 0.2697744 0.09493298 0.6352927 1.000000

> BB<-describe.simmap(btrees,plot=TRUE)
200 trees with a mapped discrete character with states:
a, b, c

trees have 11.755 changes between states on average

changes are of the following types:
a,b  a,c b,a   b,c   c,a  c,b
x->y 1.005 3.21 0.8 1.075 4.145 1.52

mean total time spent in each state is:
a          b        c    total
raw  2.1909087 0.73316975 5.390475 8.314553
prop 0.2635029 0.08817909 0.648318 1.000000

One of the neat attributes of this approach is that in addition to estimates at internal nodes, we also get posterior probabilities for tips. These are plotted in the figure above. (The necessary downside, of course, is that sometimes these estimates will be wrong, as is the case for tip t20, which we infer with high confidence is green when it should actually be red.)

That's it.