Thursday, March 21, 2013

A little more on ancestral state estimation

What is true of marginal ancestral state estimates is that (for a reversible model of evolution) they are equivalent to the conditional scaled likelihoods at the root node of the tree. This is stated in Felsenstein (2004), p. 259. Thus, we should be able to move the root to each node and recalculate the scaled conditional likelihoods for that node.

A simple function to do this in R is as follows:

rerootingMethod<-function(tree,x,model=c("ER","SYM")){
  model<-model[1]
  n<-length(tree$tip.label)
  XX<-matrix(NA,tree$Nnode,length(unique(x)))
  rownames(XX)<-1:tree$Nnode+n
  colnames(XX)<-sort(unique(x))
  YY<-ace(x,tree,type="discrete",model=model)
  XX[1,]<-YY$lik.anc[1,]
  Q<-matrix(YY$rates[YY$index.matrix],ncol(XX),ncol(XX),
   dimnames=list(colnames(XX),colnames(XX)))
  diag(Q)<--colSums(Q,na.rm=TRUE)
  for(i in 2:tree$Nnode){
    tt<-reroot(tree,node.number=i+n,position=
     tree$edge.length[which(tree$edge[,2]==(i+n))])
    XX[i,]<-ace(x,tt,type="discrete",model=
     model)$lik.anc[1,]
  }
  return(list(loglik=YY$loglik,Q=Q,marginal.anc=XX))
}

We can thus easily compare the marginal ancestral state reconstructions to the marginal frequencies from our stochastic mapping. As I noted earlier I think that these will probably be very similar - but they may not be exactly the same. This is because on the one hand we have marginal ancestral state estimates, and on the other we have marginal frequencies from a joint sampling procedure (also noted here).

First, let's do stochastic mapping:

> # simulation & analysis
> tree<-pbtree(n=50,scale=1)
> Q<-matrix(c(-2,1,1,1,-2,1,1,1,-2),3,3)
> rownames(Q)<-colnames(Q)<-letters[1:3]
> x<-sim.history(tree,Q)$states
> # stochastic mapping
> mtrees<-make.simmap(tree,x,nsim=500,model="ER")
make.simmap is sampling character histories conditioned on the transition matrix
Q =
          a         b         c
a -2.200009  1.100005  1.100005
b  1.100005 -2.200009  1.100005
c  1.100005  1.100005 -2.200009
(estimated using likelihood);
and state frequencies (used to sample the root node following Bollback 2006)
pi =
        a         b         c
0.3333333 0.3333333 0.3333333
> # function to compute the states
> getStates<-function(x){
  y<-setNames(sapply(x$maps,function(x) names(x)[1]),
   x$edge[,1])
  y<-y[as.character(length(x$tip)+1:x$Nnode)]
  return(y)
 }
> AA<-sapply(mtrees,getStates)
> piesAA<-t(apply(AA,1,function(x,levels,Nsim) summary(factor(x,levels))/Nsim,levels=letters[1:3], Nsim=length(mtrees)))
> plot(tree,no.margin=TRUE,show.tip.label=FALSE, edge.width=2)
> nodelabels(pie=piesAA,piecol=setNames(c("blue","red", "green"),colnames(piesAA)),cex=0.6)
> tips<-sapply(letters[1:3],"==",x)*1
> tiplabels(pie=tips,piecol=setNames(c("blue","red", "green"),colnames(piesAA)),cex=0.6)

Now let's get the true marginal reconstructions using the re-rooting method:

> source("rerootingMethod.R")
> BB<-rerootingMethod(tree,x,model="ER")
> piesBB<-BB$marginal.anc
> plot(tree,no.margin=TRUE,show.tip.label=FALSE, edge.width=2)
> nodelabels(pie=piesBB,piecol=setNames(c("blue","red", "green"),colnames(piesBB)),cex=0.6)
> tiplabels(pie=tips,piecol=setNames(c("blue","red", "green"),colnames(piesBB)),cex=0.6)

These seem quite similar, and in fact they are:

> plot(piesAA,piesBB,xlab="stochastic mapping",
ylab="re-rooting method")
> lines(c(0,1),c(0,1),lty="dashed")
Obviously, part of the scatter around the 1:1 line here is due to the fact that stochastic mapping is, well, stochastic - so this would contract further if more simulations were used. However, some discrepancy might remain due to the fact the x is the marginal frequencies from joint sampling; whereas y are the marginal probabilities, as noted above. Discussion of the difference between marginal & joint reconstructions can be found in Felsenstein (2004) & Yang (2006).

The re-rooting method above is very slow (of course, nowadays this just means a few seconds on a tree of 50 taxa on my i5 desktop computer, and 90s for a tree with 200 species). Much faster algorithms exist, and I believe that this is what is implemented in diversitree.

That's it on this topic!

2 comments:

  1. Thanks for the posts on this, very very useful!

    A quick question to follow up on your two posts, and some recent discussion on the R list: do you have a good feeling for the quickest implementation to do this on large trees?

    (Happy to assume for the purposes of this question that I'm O.K. with the stochastic variation in stochastic character mapping, and that I'm not worried about the differences between marginal frequencies and marginal probabilities...)

    ReplyDelete
    Replies
    1. Hi Rob.

      I think diversitree is the fastest. More on this, here.

      All the best, Liam

      Delete