Thursday, February 15, 2024

A little bit about ancestral state estimation under bounded Brownian motion using phytools

Close followers of this blog will already be familiar with my recent obsession with Boucher & Démery (2016), aka. the most important phylogenetic comparative methods paper you’ve probably never read.

If you haven’t, I’d recommend my earlier posts on this topic here and here.

I’ve now implemented the bounded Brownian motion (BBM) model of Boucher & Démery (2016) in phytools via the function bounded_bm, but what I hadn’t mentioned previously as that I’ve also added an ancr ancestral state estimation routine for the model object class.

Code is here. Let’s play around with it a bit.

library(phytools)
packageVersion("phytools")
## [1] '2.1.13'

First, I’m going to simulate a tree and then some data under BBM using phytools::fastBM as follows.

tree<-pbtree(n=200,scale=1)
x<-fastBM(tree,sig2=0.4,a=0.5,bounds=c(0,1),internal=TRUE)
phenogram(tree,x,ftype="off",col=make.transparent("blue",0.5),
  las=1,cex.axis=0.8)
clip(0,1,-1,2)
abline(h=c(0,1),lty="dotted")
text(0,0.97,"upper bound",pos=4,cex=0.7)
text(0,0.03,"lower bound",pos=4,cex=0.7)

plot of chunk unnamed-chunk-4

This isn’t as pretty as my illustrative bounded Brownian motion simulations of earlier posts; however, we can see that evolution is both bounded and that the bounds are being reached multiple times.

Before we continue, I’m going to separate out the tip values (which is all we’d know in an empirical study, of course) from the simulated internal states.

a<-x[1:tree$Nnode+Ntip(tree)]
x<-x[tree$tip.label]

All right. Let’s go ahead and first fit an unbounded (i.e., standard Brownian) model to these data.

fit_unbounded<-bounded_bm(tree,x,lik.func="eigen",parallel=TRUE)
fit_unbounded
## Object of class "bounded_bm" based on
##  	a discretization with k = 200 levels.
## 
## Unwrapped (i.e., bounded) model
## 
## Set or estimated bounds: [ -0.974998 , 1.983823 ]
## 
## Fitted model parameters:
## 	 sigsq: 0.250278 
##      x0: 0.496016 
## 
## Log-likelihood: 16.589545 
## 
## R thinks it has found the ML solution.

To proceed and estimate ancestral states I’ll use the ancr method as follows.

anc_unbounded<-ancr(fit_unbounded)
anc_unbounded
## Ancestral character estimates from "bounded_bm" object:
##       201      202      203      204      205      206     
##  0.496016 0.505804 0.501098 0.529028 0.547592 0.599433 ....
## 
## Lower & upper 95% CIs:
##        lower    upper
## 201 0.245516 0.748515
## 202  0.26031 0.763309
## 203  0.26031 0.748515
## 204 0.289898 0.778104
## 205 0.275104 0.822486
## 206 0.289898  0.91125
##         ....     ....

Now the actual procedure that I’m employing inside the ancr method uses the discretized bin midpoints and marginal scaled likelihoods to compute a weighted mean estimate of the trait at each node – but this strategy produces virtually identical estimates to those we would’ve obtained using ML on the original scale of our data, which we can see by comparing these estimates to those from phytools::fastAnc.

par(mar=c(5.1,4.1,2.1,2.1))
plot(fastAnc(tree,x),anc_unbounded$ace,bty="n",xlim=c(0,1),ylim=c(0,1),
  pch=21,bg=make.transparent("grey",0.5),cex=1.2,xlab="MLEs from fastAnc",
  ylab="estimates from discretized space using ancr")
grid()

plot of chunk unnamed-chunk-8

I guess the only point here is that we’re getting the same values! So far so good.

The more important question, though, is how close our estimates are to the true generating states.

There, we’re not doing as great. To see that, I’m going to graph not only the estimates, but their 95% confidence intervals. I’ll then use grey & red dots, respectively, to highlight true values falling within (grey) our outside (red) our 95% CI around each estimate. Note that on average we would expect our CI to include the true value 19 times out of 20.

par(mar=c(5.1,4.1,2.1,2.1))
plot(NA,bty="n",xlim=c(0,1),
  ylim=range(anc_unbounded$CI95),xlab="MLE estimates & CIs",
  ylab="true values")
for(i in 1:nrow(anc_unbounded$CI95)) lines(rep(anc_unbounded$ace[i],2),
  anc_unbounded$CI95[i,],lwd=2,col=make.transparent("blue",0.25))
outCI<-union(which(a<anc_unbounded$CI95[,1]),which(a>anc_unbounded$CI95[,2]))
cols<-rep("grey",length(a))
cols[outCI]<-"red"
points(anc_unbounded$ace,a,pch=20,col=cols)
grid()
legend("topleft",c("95% CI","true value on CI","true value outside CI"),
  lty=c("solid",NA,NA),pch=c(NA,20,20),
  col=c(make.transparent("blue",0.25),"grey","red"),
  cex=0.6,bty="n",pt.cex=1,lwd=2)

plot of chunk unnamed-chunk-9

Typically, here we’ll see a substantively greater fraction than 5% of estimates falling outside the CI.

length(outCI)/tree$Nnode
## [1] 0.1256281

OK, now let’s do the same using bounded Brownian motion. I’ll do that using the same function as before but in which I update lims = c(0,1), the true values of the limits of our trait.

fit_bounded<-bounded_bm(tree,x,lims=c(0,1),lik.func="eigen",parallel=TRUE)
fit_bounded
## Object of class "bounded_bm" based on
##  	a discretization with k = 200 levels.
## 
## Unwrapped (i.e., bounded) model
## 
## Set or estimated bounds: [ 0 , 1 ]
## 
## Fitted model parameters:
## 	 sigsq: 0.442651 
##      x0: 0.545151 
## 
## Log-likelihood: 55.680628 
## 
## R thinks it has found the ML solution.
anc_bounded<-ancr(fit_bounded)
anc_bounded
## Ancestral character estimates from "bounded_bm" object:
##       201      202     203      204      205      206     
##  0.545151 0.567683 0.57462 0.638257 0.664605 0.714024 ....
## 
## Lower & upper 95% CIs:
##      lower  upper
## 201 0.0575 0.9675
## 202 0.0675 0.9725
## 203 0.0775 0.9725
## 204 0.1225 0.9825
## 205 0.1275 0.9875
## 206 0.1925 0.9875
##       ....   ....

We’re now ready to make the same comparison as before.

par(mar=c(5.1,4.1,2.1,2.1))
plot(NA,bty="n",xlim=c(0,1),
  ylim=range(anc_bounded$CI95),xlab="MLE estimates & CIs",
  ylab="true values")
for(i in 1:nrow(anc_bounded$CI95)) lines(rep(anc_bounded$ace[i],2),
  anc_bounded$CI95[i,],lwd=2,col=make.transparent("blue",0.25))
outCI<-union(which(a<anc_bounded$CI95[,1]),which(a>anc_bounded$CI95[,2]))
cols<-rep("grey",length(a))
cols[outCI]<-"red"
points(anc_bounded$ace,a,pch=20,col=cols)
grid()
legend("topleft",c("95% CI","true value on CI","true value outside CI"),
  lty=c("solid",NA,NA),pch=c(NA,20,20),
  col=c(make.transparent("blue",0.25),"grey","red"),
  cex=0.6,bty="n",pt.cex=1,lwd=2)

plot of chunk unnamed-chunk-13

We should see that our CIs are broader, but they also capture much closer to 95% of the true, generating values of the trait at nodes. If we repeat this simulation many times, we’ll find that it’s true in general.

length(outCI)/tree$Nnode
## [1] 0.04522613

That’s all I’ve got, folks.

No comments:

Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.