## Thursday, February 15, 2024

### A little bit about ancestral state estimation under bounded Brownian motion using phytools

Close followers of this blog will already be familiar with my recent obsession with Boucher & DÃ©mery (2016), aka. the most important phylogenetic comparative methods paper you’ve probably never read.

If you haven’t, I’d recommend my earlier posts on this topic here and here.

I’ve now implemented the bounded Brownian motion (BBM) model of Boucher & DÃ©mery (2016) in phytools via the function `bounded_bm`, but what I hadn’t mentioned previously as that I’ve also added an `ancr` ancestral state estimation routine for the model object class.

Code is here. Let’s play around with it a bit.

``````library(phytools)
packageVersion("phytools")
``````
``````## [1] '2.1.13'
``````

First, I’m going to simulate a tree and then some data under BBM using `phytools::fastBM` as follows.

``````tree<-pbtree(n=200,scale=1)
x<-fastBM(tree,sig2=0.4,a=0.5,bounds=c(0,1),internal=TRUE)
phenogram(tree,x,ftype="off",col=make.transparent("blue",0.5),
las=1,cex.axis=0.8)
clip(0,1,-1,2)
abline(h=c(0,1),lty="dotted")
text(0,0.97,"upper bound",pos=4,cex=0.7)
text(0,0.03,"lower bound",pos=4,cex=0.7)
``````

This isn’t as pretty as my illustrative bounded Brownian motion simulations of earlier posts; however, we can see that evolution is both bounded and that the bounds are being reached multiple times.

Before we continue, I’m going to separate out the tip values (which is all we’d know in an empirical study, of course) from the simulated internal states.

``````a<-x[1:tree\$Nnode+Ntip(tree)]
x<-x[tree\$tip.label]
``````

All right. Let’s go ahead and first fit an unbounded (i.e., standard Brownian) model to these data.

``````fit_unbounded<-bounded_bm(tree,x,lik.func="eigen",parallel=TRUE)
fit_unbounded
``````
``````## Object of class "bounded_bm" based on
##  	a discretization with k = 200 levels.
##
## Unwrapped (i.e., bounded) model
##
## Set or estimated bounds: [ -0.974998 , 1.983823 ]
##
## Fitted model parameters:
## 	 sigsq: 0.250278
##      x0: 0.496016
##
## Log-likelihood: 16.589545
##
## R thinks it has found the ML solution.
``````

To proceed and estimate ancestral states I’ll use the `ancr` method as follows.

``````anc_unbounded<-ancr(fit_unbounded)
anc_unbounded
``````
``````## Ancestral character estimates from "bounded_bm" object:
##       201      202      203      204      205      206
##  0.496016 0.505804 0.501098 0.529028 0.547592 0.599433 ....
##
## Lower & upper 95% CIs:
##        lower    upper
## 201 0.245516 0.748515
## 202  0.26031 0.763309
## 203  0.26031 0.748515
## 204 0.289898 0.778104
## 205 0.275104 0.822486
## 206 0.289898  0.91125
##         ....     ....
``````

Now the actual procedure that I’m employing inside the `ancr` method uses the discretized bin midpoints and marginal scaled likelihoods to compute a weighted mean estimate of the trait at each node – but this strategy produces virtually identical estimates to those we would’ve obtained using ML on the original scale of our data, which we can see by comparing these estimates to those from `phytools::fastAnc`.

``````par(mar=c(5.1,4.1,2.1,2.1))
plot(fastAnc(tree,x),anc_unbounded\$ace,bty="n",xlim=c(0,1),ylim=c(0,1),
pch=21,bg=make.transparent("grey",0.5),cex=1.2,xlab="MLEs from fastAnc",
ylab="estimates from discretized space using ancr")
grid()
``````

I guess the only point here is that we’re getting the same values! So far so good.

The more important question, though, is how close our estimates are to the true generating states.

There, we’re not doing as great. To see that, I’m going to graph not only the estimates, but their 95% confidence intervals. I’ll then use grey & red dots, respectively, to highlight true values falling within (grey) our outside (red) our 95% CI around each estimate. Note that on average we would expect our CI to include the true value 19 times out of 20.

``````par(mar=c(5.1,4.1,2.1,2.1))
plot(NA,bty="n",xlim=c(0,1),
ylim=range(anc_unbounded\$CI95),xlab="MLE estimates & CIs",
ylab="true values")
for(i in 1:nrow(anc_unbounded\$CI95)) lines(rep(anc_unbounded\$ace[i],2),
anc_unbounded\$CI95[i,],lwd=2,col=make.transparent("blue",0.25))
outCI<-union(which(a<anc_unbounded\$CI95[,1]),which(a>anc_unbounded\$CI95[,2]))
cols<-rep("grey",length(a))
cols[outCI]<-"red"
points(anc_unbounded\$ace,a,pch=20,col=cols)
grid()
legend("topleft",c("95% CI","true value on CI","true value outside CI"),
lty=c("solid",NA,NA),pch=c(NA,20,20),
col=c(make.transparent("blue",0.25),"grey","red"),
cex=0.6,bty="n",pt.cex=1,lwd=2)
``````

Typically, here we’ll see a substantively greater fraction than 5% of estimates falling outside the CI.

``````length(outCI)/tree\$Nnode
``````
``````## [1] 0.1256281
``````

OK, now let’s do the same using bounded Brownian motion. I’ll do that using the same function as before but in which I update `lims = c(0,1)`, the true values of the limits of our trait.

``````fit_bounded<-bounded_bm(tree,x,lims=c(0,1),lik.func="eigen",parallel=TRUE)
fit_bounded
``````
``````## Object of class "bounded_bm" based on
##  	a discretization with k = 200 levels.
##
## Unwrapped (i.e., bounded) model
##
## Set or estimated bounds: [ 0 , 1 ]
##
## Fitted model parameters:
## 	 sigsq: 0.442651
##      x0: 0.545151
##
## Log-likelihood: 55.680628
##
## R thinks it has found the ML solution.
``````
``````anc_bounded<-ancr(fit_bounded)
anc_bounded
``````
``````## Ancestral character estimates from "bounded_bm" object:
##       201      202     203      204      205      206
##  0.545151 0.567683 0.57462 0.638257 0.664605 0.714024 ....
##
## Lower & upper 95% CIs:
##      lower  upper
## 201 0.0575 0.9675
## 202 0.0675 0.9725
## 203 0.0775 0.9725
## 204 0.1225 0.9825
## 205 0.1275 0.9875
## 206 0.1925 0.9875
##       ....   ....
``````

We’re now ready to make the same comparison as before.

``````par(mar=c(5.1,4.1,2.1,2.1))
plot(NA,bty="n",xlim=c(0,1),
ylim=range(anc_bounded\$CI95),xlab="MLE estimates & CIs",
ylab="true values")
for(i in 1:nrow(anc_bounded\$CI95)) lines(rep(anc_bounded\$ace[i],2),
anc_bounded\$CI95[i,],lwd=2,col=make.transparent("blue",0.25))
outCI<-union(which(a<anc_bounded\$CI95[,1]),which(a>anc_bounded\$CI95[,2]))
cols<-rep("grey",length(a))
cols[outCI]<-"red"
points(anc_bounded\$ace,a,pch=20,col=cols)
grid()
legend("topleft",c("95% CI","true value on CI","true value outside CI"),
lty=c("solid",NA,NA),pch=c(NA,20,20),
col=c(make.transparent("blue",0.25),"grey","red"),
cex=0.6,bty="n",pt.cex=1,lwd=2)
``````

We should see that our CIs are broader, but they also capture much closer to 95% of the true, generating values of the trait at nodes. If we repeat this simulation many times, we’ll find that it’s true in general.

``````length(outCI)/tree\$Nnode
``````
``````## [1] 0.04522613
``````

That’s all I’ve got, folks.