I recently received an email with the following content (among other things):

*"Have you ever considered developing algorithms for inferring character states for taxa that are missing data? I looked at phytools and did not see an application that does this."*

Well, first of all, at least one method in phytools, the function ancThresh for ancestral character estimation under the threshold model, does allow for uncertain tip states by permitting input of *known* states or a matrix of *prior probabilities* on tips. In the case where data were totally unknown (i.e., missing) we would just provide a completely uninformative prior probability distribution for the state.

It is relatively straightforward to use the same general idea if sampling ancestral character histories using the method of stochastic character mapping. As discussed in prior posts (e.g., 1, 2) it is easy to get the marginal probabilities from joint sampling of ancestral states using stochastic mapping. I have posted an updated version of make.simmap that can take as input either a character state vector of a matrix of prior probabilities for tip states.

Implementation of this was surprisingly easy. First, I modified the internally called function apeAce - in which I have adapted code from ape's ace to compute the scaled conditional likelihoods for internal nodes. Here, I just need to adjust the code slightly so that we use the prior probabilities for tip states during post-order tree traversal during the pruning algorithm. Next, I modified the mapping code so that tip edges are treated the same way as internal edges. In this way we can assign node *and* tip states stochastically via pre-order tree traversal in the first step of the mapping algorithm described in Bollback (2006). Piece of cake!

One of the interesting features of both this method & ancThresh is that it allows us to estimate the posterior probability that uncertain tips are in any state.

Here's a quick demo. This also uses an updated version of desribe.simmap. Both updated functions are in the latest minor release of phytools (phytools 0.2-34), which can be downloaded & installed from source:

> require(phytools)

> packageVersion("phytools")

[1] ‘0.2.34’

> # this function turns a discrete character into a

> # binary matrix

> to.matrix<-function(x,seq) sapply(seq,"==",x)*1

>

> # simulate tree & data

> tree<-pbtree(n=25,scale=1)

> Q<-matrix(c(-2,1,1,1,-2,1,1,1,-2),3,3)

> rownames(Q)<-colnames(Q)<-letters[1:3]

> x<-sim.history(tree,Q)$states

> x

t9 t10 t7 t18 t19 t1 t14 t15 t3 t8 t11 t16 t17 t13 t20

"a" "c" "c" "c" "c" "a" "b" "b" "c" "a" "c" "c" "c" "c" "b"

t21 t6 t4 t5 t24 t25 t12 t2 t22 t23

"c" "a" "c" "c" "c" "c" "c" "c" "a" "a"

> # convert to binary matrix

> PP<-to.matrix(x,letters[1:3])

> PP

a b c

t9 1 0 0

t10 0 0 1

t7 0 0 1

t18 0 0 1

t19 0 0 1

t1 1 0 0

t14 0 1 0

t15 0 1 0

t3 0 0 1

t8 1 0 0

t11 0 0 1

t16 0 0 1

t17 0 0 1

t13 0 0 1

t20 0 1 0

t21 0 0 1

t6 1 0 0

t4 0 0 1

t5 0 0 1

t24 0 0 1

t25 0 0 1

t12 0 0 1

t2 0 0 1

t22 1 0 0

t23 1 0 0

OK now let's make some of our tip states (arbitrarily) uncertain:

> PP["t20",2:3]<-c(0.5,0.5)

> # we are totally ignorant about t2 and t23

> PP["t2",]<-rep(1/3,3)

> PP["t23",]<-rep(1/3,3)

> PP

a b c

t9 1.000 0.000 0.000

t10 0.000 0.000 1.000

t7 0.000 0.000 1.000

t18 0.000 0.000 1.000

t19 0.000 0.000 1.000

t1 1.000 0.000 0.000

t14 0.000 1.000 0.000

t15 0.000 1.000 0.000

t3 0.000 0.000 1.000

t8 1.000 0.000 0.000

t11 0.000 0.000 1.000

t16 0.000 0.000 1.000

t17 0.000 0.000 1.000

t13 0.000 0.000 1.000

t20 0.000 0.500 0.500

t21 0.000 0.000 1.000

t6 1.000 0.000 0.000

t4 0.000 0.000 1.000

t5 0.000 0.000 1.000

t24 0.000 0.000 1.000

t25 0.000 0.000 1.000

t12 0.000 0.000 1.000

t2 0.333 0.333 0.333

t22 1.000 0.000 0.000

t23 0.333 0.333 0.333

Finally, let's do stochastic character mapping (and estimate ancestral states) using either our true data, or our data with uncertainty and missing data for some tips.

> atrees<-make.simmap(tree,x,nsim=200,model="ER")

make.simmap is sampling character histories conditioned on the transition matrix

Q =

a b c

a -1.4409903 0.7204951 0.7204951

b 0.7204951 -1.4409903 0.7204951

c 0.7204951 0.7204951 -1.4409903

(estimated using likelihood);

and state frequencies (used to sample the root node following Bollback 2006)

pi =

a b c

0.3333333 0.3333333 0.3333333

> # data with uncertainty

> btrees<-make.simmap(tree,PP,nsim=200,model="ER")

make.simmap is sampling character histories conditioned on the transition matrix

Q =

a b c

a -1.1736451 0.5868225 0.5868225

b 0.5868225 -1.1736451 0.5868225

c 0.5868225 0.5868225 -1.1736451

(estimated using likelihood);

and state frequencies (used to sample the root node following Bollback 2006)

pi =

a b c

0.3333333 0.3333333 0.3333333

> # compute posterior probabilities at nodes & tips

> AA<-describe.simmap(atrees,plot=TRUE)

200 trees with a mapped discrete character with states:

a, b, c

trees have 14.255 changes between states on average

changes are of the following types:

a,b a,c b,a b,c c,a c,b

x->y 1.37 3.82 1.045 1.67 4.01 2.34

mean total time spent in each state is:

a b c total

raw 2.2430532 0.78932529 5.2821745 8.314553

prop 0.2697744 0.09493298 0.6352927 1.000000

200 trees with a mapped discrete character with states:

a, b, c

trees have 11.755 changes between states on average

changes are of the following types:

a,b a,c b,a b,c c,a c,b

x->y 1.005 3.21 0.8 1.075 4.145 1.52

mean total time spent in each state is:

a b c total

raw 2.1909087 0.73316975 5.390475 8.314553

prop 0.2635029 0.08817909 0.648318 1.000000

One of the neat attributes of this approach is that in addition to estimates at internal nodes, we also get posterior probabilities for tips. These are plotted in the figure above. (The necessary downside, of course, is that sometimes these estimates will be *wrong*, as is the case for tip t20, which we infer with high confidence is *green* when it should actually be *red*.)

That's it.

I have my missing data as NA (for a binary state with C3 vs C4 photosynthesis. What should I do to make the conversion to the uncertain state.....

ReplyDelete> # t20 might be 'a' or 'b'

ReplyDelete> PP["t20",2:3]<-c(0.5,0.5)

It looks like a typo? It should be

PP["t20",1:2]<-c(0.5,0.5) right?

This was very helpful, thanks! How would you go about making a legend (and designating colors)? Seeing that each species is represented by the probabilities of 3 states, and not by a single character, it doesn't seem like the "add.simmap.legend" function works as usual (i.e. by designating a unique color to each character in the dataset).

ReplyDelete