Saturday, April 12, 2025

A minimum sample size for multirateBM in phytools

A phytools user recently contacted me about the function multirateBM that implements the variable-rate continuous trait evolution model using penalized-likelihood method that is originally described in Revell (2022).

Their inquiry had multiple parts, but I think it essentially boiled down to “what is the minimum sample size [i.e., number of tips] for this method?” Specifically, they wanted to know whether this method needed trees with hundreds of operational taxa, thousands, or dozens.

This is actually quite a bit harder of a question than it might seem… or maybe its not hard at all. I would say that the answer is “it depends.” Specifically, it depends on how large the effect is of your rate heterogeneity.

To see what I mean, let’s simulate a very modest-sized tree containing just 24 taxa as follows:

library(phytools)
tree<-pbtree(n=24,tip.label=letters[24:1])
plotTree(tree,lwd=1)
nodelabels(cex=0.6)

plot of chunk unnamed-chunk-3

Next, let’s simulate a high rate of evolution, say σ2=10\sigma^2 = 10 for the clade descended from node labeled 28 in this plot, and a base rate 10 ×\times lower of σ2=1.0\sigma^2 = 1.0 throughout the rest of the tree.

tt<-paintSubTree(tree,28,state="1",anc.state="0")
x<-sim.rates(tt,setNames(c(1,10),0:1))
par(mar=c(5.1,4.1,2.1,2.1))
cols<-setNames(c("blue","red"),0:1)
phenogram(tt,x,las=1,cex.axis=0.8,fsize=0.6,
  colors=cols,lwd=1)
legend("topleft",c(expression(paste(sigma^2,' = 1')),
  expression(paste(sigma^2,' = 10'))),
  col=cols,lty="solid",bty="n")

plot of chunk unnamed-chunk-4

Now let’s see if we can detect this rate shift using multirateBM.

I bet we can!

obj<-multirateBM(tree,x)
## Beginning optimization....
## Optimization iteration 1. Using "L-BFGS-B" optimization method.
## Best (penalized) log-likelihood so far: -44.1409 
## Optimization iteration 2. Using "Nelder-Mead" optimization method.
## Best (penalized) log-likelihood so far: -44.1409 
## Optimization iteration 3. Using "BFGS" optimization method.
## Best (penalized) log-likelihood so far: -44.1401 
## Done optimization.
obj
## Multi-rate Brownian model using multirateBM.
## 
## Fitted rates:
##         x        w        v       u        t        s        r        q     
##  1.223448 10.03531 8.792052 7.74407 9.012035 7.655695 7.338579 8.853072 ....
## 
## lambda penalty term: 1
## log-likelihood:  -41.923968 
## AIC:  179.847936 
## 
## R thinks it has found a solution.
plot(obj,ftype="off",lwd=5)

plot of chunk unnamed-chunk-7

Yes! (And it’s not even close.)

Indeed, even with the default λ\lambda and a single rate shift on a small tree, we’re getting very close to our generating conditions.

Since I just used the default smoothing parameter value, λ=1\lambda = 1, let’s try λ=0.1\lambda = 0.1 and λ=10\lambda = 10 (while realizing that this is not very rigorous and that in practice I would probably recommend some sort of cross-validation procedure be used to select λ\lambda rather than just picking values arbitrarily).

fit_lambda0.1<-multirateBM(tree,x,lambda=0.1)
## Beginning optimization....
## Optimization iteration 1. Using "L-BFGS-B" optimization method.
## Best (penalized) log-likelihood so far: -39.2331 
## Optimization iteration 2. Using "Nelder-Mead" optimization method.
## Best (penalized) log-likelihood so far: -39.2331 
## Optimization iteration 3. Using "BFGS" optimization method.
## Best (penalized) log-likelihood so far: -39.2331 
## Done optimization.
fit_lambda0.1
## Multi-rate Brownian model using multirateBM.
## 
## Fitted rates:
##         x        w        v        u        t        s        r        q     
##  0.022254 24.94561 9.068857 13.80581 19.14336 6.017043 4.117849 8.499844 ....
## 
## lambda penalty term: 0.1
## log-likelihood:  -36.919468 
## AIC:  169.838936 
## 
## R thinks it has found a solution.
fit_lambda10<-multirateBM(tree,x,lambda=10)
## Beginning optimization....
## Optimization iteration 1. Using "L-BFGS-B" optimization method.
## Best (penalized) log-likelihood so far: -52.8556 
## Optimization iteration 2. Using "Nelder-Mead" optimization method.
## Best (penalized) log-likelihood so far: -52.8556 
## Optimization iteration 3. Using "BFGS" optimization method.
## Best (penalized) log-likelihood so far: -52.7846 
## Done optimization.
fit_lambda10
## Multi-rate Brownian model using multirateBM.
## 
## Fitted rates:
##         x        w        v        u        t        s        r       q     
##  4.161493 6.482993 6.378398 6.166163 6.311752 6.434874 6.402634 6.66946 ....
## 
## lambda penalty term: 10
## log-likelihood:  -45.330136 
## AIC:  186.660272 
## 
## R thinks it has found a solution.
par(mfrow=c(1,2))
plot(fit_lambda0.1,mar=c(1.1,1.1,3.1,1.1))
mtext(expression(paste("a) ",lambda," = 0.1")),line=1,
  adj=0)
plot(fit_lambda10,mar=c(1.1,1.1,3.1,1.1))
mtext(expression(paste("b) ",lambda," = 10")),line=1,
  adj=0)

plot of chunk unnamed-chunk-12

The point is that it seems we don’t need a very large tree for multirateBM to successfully identify rate heterogeneity – so long as the effect size (that is, the variation in rate among clades or edges) is large!

No comments:

Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.