Monday, October 24, 2016

On the accuracy of reconstructed tip states using make.simmap

Earlier (1, 2) I blogged about reconstructing tip states when they are unknown.

What one needs to do to show that this method is working as designed, for discrete traits, at least, is not to merely measure whether the most probable state is correct - but to compare the posterior probabilities to the frequency of times said state is in fact correct. In other words, we expect a tip state that is estimated to have a posterior probability of being in state a of 0.6, for instance, to in fact be in state a 60% of the time. Another way of saying this is that if we take all tip states with PP(a) = x, x×100% of the time these tips should have state a. This was pointed out to me as a good way to measure the performance of a Bayesian estimation method by Brian O'Meara a few yeara ago, and, of course, he was correct.

Since posterior probabilities can take any value on [0,1], in practice this means we need to bin our posterior probabilities.

Here, that's exactly what I have done. First, I simulated trees & discrete character data 200 times; next I arbitrarily set 40% (40/100 taxa) of tip values to be unknown; then I sampled their true values using stochastic mapping; and, finally, I binned the posterior probabilities and asked what with what frequency, say, tips with posterior probabilities between PP(a)=0.05 & PP(a)=0.1 were in fact in state a, and so on. Here I elected to fix the transition matrix, Q, to its true value for stochastic mapping - essentially to avoid confounding error in the estimation of tip states given Q with estimation of Q itself.

Here is what I did & found piece by piece.

Load packages & simulate trees. Set Q:

library(phytools)
## simulate trees & set Q
trees<-pbtree(n=200,scale=1,nsim=200)
Q<-matrix(c(-1,1,1,-1),2,2)
rownames(Q)<-colnames(Q)<-letters[1:2]
Q
##    a  b
## a -1  1
## b  1 -1

Create vectors to store our results. I have to count all instances in which the posterior probabilities were between x & y (denoms), as well as the frequencies for each of those intervals in which each state was correct.

nums<-denoms<-
    setNames(rep(0,20),paste(seq(0,0.95,0.05),"<->",
    seq(0.05,1,0.05),sep=""))

Now let's run our simulations:

for(i in 1:length(trees)){
    missing<-sample(trees[[i]]$tip.label,40)
    x<-sim.history(trees[[i]],Q,message=FALSE)$states
    X<-to.matrix(x,letters[1:2])
    X[missing,]<-rep(0.5,2)
    m<-make.simmap(trees[[i]],X,nsim=100,Q=Q,message=FALSE)
    tips<-summary(m)$tips[,letters[1:2]]
    n<-round(20*tips[missing,1]*to.matrix(x[missing],
        letters[1:2])[,1])
    d<-round(20*tips[missing,1])
    for(j in 1:length(n)){
        nums[n[j]]<-nums[n[j]]+1
        denoms[d[j]]<-denoms[d[j]]+1
    }
    if(i%%20==0&&i!=0) cat(".\n") else cat(".")
}
## ....................
## ....................
## ....................
## ....................
## ....................
## ....................
## ....................
## ....................
## ....................
## ....................
cat("Done simulations.\n")
## Done simulations.

Finally, we have our results!

print(nums/denoms)
##   0<->0.05 0.05<->0.1 0.1<->0.15 0.15<->0.2 0.2<->0.25 0.25<->0.3 
## 0.05668449 0.07736390 0.15746421 0.19537275 0.22800000 0.27450980 
## 0.3<->0.35 0.35<->0.4 0.4<->0.45 0.45<->0.5 0.5<->0.55 0.55<->0.6 
## 0.31952663 0.37956204 0.43801653 0.48739496 0.42500000 0.58955224 
## 0.6<->0.65 0.65<->0.7 0.7<->0.75 0.75<->0.8 0.8<->0.85 0.85<->0.9 
## 0.69677419 0.63068182 0.71984436 0.78932584 0.86315789 0.89079563 
## 0.9<->0.95   0.95<->1 
## 0.94976077 0.98012232
plot(seq(0.025,0.975,0.05),nums/denoms,xlim=c(0,1),ylim=c(0,1),
    xlab="Posterior probability (state 'a')",
    ylab="Observed frequency in state 'a'",pch=21,bg="grey",
    cex=1.2)
lines(c(0,1),c(0,1),lty="dashed",lwd=2,
    col=make.transparent("blue",0.4))

plot of chunk unnamed-chunk-1

If our results are on a 1:1 line, as I hope will be the case, then most likely the method is working as designed.

No comments:

Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.