Earlier (1, 2) I blogged about reconstructing tip states when they are unknown.

What one needs to do to show that this method is working as designed, for
discrete traits, at least, is not to merely measure whether the most probable
state is correct - but to compare the posterior probabilities to the
frequency of times said state is in fact correct. In other words, we expect
a tip state that is estimated to have a posterior probability of being in
state *a* of 0.6, for instance, to in fact *be* in state *a*
60% of the time. Another way of saying this is that if we take all tip states
with *PP(a) = x*, *x*×100% of the time these tips should have
state *a*. This was pointed out to me as a good way to measure the
performance of a Bayesian estimation method by
Brian O'Meara a few yeara ago, and,
of course, he was correct.

Since posterior probabilities can take any value on [0,1], in practice this means we need to bin our posterior probabilities.

Here, that's exactly what I have done. First, I simulated trees & discrete
character data 200 times; next I arbitrarily set 40% (40/100 taxa) of tip
values to be unknown; then I sampled their true values using stochastic
mapping; and, finally, I binned the posterior probabilities and asked what
with what frequency, say, tips with posterior probabilities between
*PP(a)=0.05* & *PP(a)=0.1* were in fact in state *a*, and
so on. Here I elected to fix the transition matrix, * Q*, to its
true value for stochastic mapping - essentially to avoid confounding error in
the estimation of tip states given

*with estimation of*

**Q***itself.*

**Q**Here is what I did & found piece by piece.

Load packages & simulate trees. Set Q:

```
library(phytools)
## simulate trees & set Q
trees<-pbtree(n=200,scale=1,nsim=200)
Q<-matrix(c(-1,1,1,-1),2,2)
rownames(Q)<-colnames(Q)<-letters[1:2]
Q
```

```
## a b
## a -1 1
## b 1 -1
```

Create vectors to store our results. I have to count all instances in which
the posterior probabilities were between *x* & *y*
(`denoms`

), as well as the frequencies for each of those
intervals in which each state was correct.

```
nums<-denoms<-
setNames(rep(0,20),paste(seq(0,0.95,0.05),"<->",
seq(0.05,1,0.05),sep=""))
```

Now let's run our simulations:

```
for(i in 1:length(trees)){
missing<-sample(trees[[i]]$tip.label,40)
x<-sim.history(trees[[i]],Q,message=FALSE)$states
X<-to.matrix(x,letters[1:2])
X[missing,]<-rep(0.5,2)
m<-make.simmap(trees[[i]],X,nsim=100,Q=Q,message=FALSE)
tips<-summary(m)$tips[,letters[1:2]]
n<-round(20*tips[missing,1]*to.matrix(x[missing],
letters[1:2])[,1])
d<-round(20*tips[missing,1])
for(j in 1:length(n)){
nums[n[j]]<-nums[n[j]]+1
denoms[d[j]]<-denoms[d[j]]+1
}
if(i%%20==0&&i!=0) cat(".\n") else cat(".")
}
```

```
## ....................
## ....................
## ....................
## ....................
## ....................
## ....................
## ....................
## ....................
## ....................
## ....................
```

```
cat("Done simulations.\n")
```

```
## Done simulations.
```

Finally, we have our results!

```
print(nums/denoms)
```

```
## 0<->0.05 0.05<->0.1 0.1<->0.15 0.15<->0.2 0.2<->0.25 0.25<->0.3
## 0.05668449 0.07736390 0.15746421 0.19537275 0.22800000 0.27450980
## 0.3<->0.35 0.35<->0.4 0.4<->0.45 0.45<->0.5 0.5<->0.55 0.55<->0.6
## 0.31952663 0.37956204 0.43801653 0.48739496 0.42500000 0.58955224
## 0.6<->0.65 0.65<->0.7 0.7<->0.75 0.75<->0.8 0.8<->0.85 0.85<->0.9
## 0.69677419 0.63068182 0.71984436 0.78932584 0.86315789 0.89079563
## 0.9<->0.95 0.95<->1
## 0.94976077 0.98012232
```

```
plot(seq(0.025,0.975,0.05),nums/denoms,xlim=c(0,1),ylim=c(0,1),
xlab="Posterior probability (state 'a')",
ylab="Observed frequency in state 'a'",pch=21,bg="grey",
cex=1.2)
lines(c(0,1),c(0,1),lty="dashed",lwd=2,
col=make.transparent("blue",0.4))
```

If our results are on a 1:1 line, as I hope will be the case, then most likely the method is working as designed.

## No comments:

## Post a Comment