Phylogenetic Tools for Comparative Biology: Why parameter estimates in fitPagel can hit upper bounds, even if they don't in fitMk for each trait modeled separately....

In response to a user issue, I just wanted to quickly & simply demonstrate an interesting case in which parameter estimates under a Pagel ’94 (Pagel 1994) correlated binary character evolution model can sometimes hit the bounds even when the rates of evolution for each individual character are nowhere near the upper boundary value for optimization.

To do this, I’m going to simulated two discrete traits, x and y, where one is the perfect mirror image of the other. (They don’t need to be perfectly correlated for this to be an issue, but our two traits must be correlated.)

library(phytools)
## generating conditions
N<-200
t<-1.0
q<-0.1
## simulate a tree
tree<-pbtree(n=N,scale=t)
## our Q matrix to simulate x
Q<-q*matrix(c(-1,1,1,-1),2,2,
  dimnames=list(0:1,0:1))
## simulate x
x<-sim.Mk(tree,Q)
## create y as its mirror image
y<-setNames(as.factor(c(1:0)[as.numeric(x)]),
  names(x))

To see that we’ve accomplished what we set out to do we can create a simple visualization of our two traits as follows.

par(mfrow=c(1,2))
## plot tree 1
plotTree(tree,ftype="off",lwd=1)
## get last plotted "phylo" coordinates
pp<-get("last_plot.phylo",envir=.PlotPhyloEnv)
## set colors
cols<-hcl.colors(n=2)[as.numeric(x[tree$tip.label])]
## plot polygons at the tips
for(i in 1:N){
  polygon(pp$xx[i]+t*c(0,0.05,0.05,0),
    pp$yy[i]+c(-0.5,-0.5,0.5,0.5),
    col=cols[i],border="transparent")
}
## repeat for tree 2, but plotted backwards
plotTree(tree,ftype="off",lwd=1,xlim=pp$x.lim[2:1])
pp<-get("last_plot.phylo",envir=.PlotPhyloEnv)
cols<-hcl.colors(n=2)[as.numeric(y[tree$tip.label])]
for(i in 1:N){
  polygon(pp$xx[i]+t*c(0,0.05,0.05,0),
    pp$yy[i]+c(-0.5,-0.5,0.5,0.5),
    col=cols[i],border="transparent")
}

plot of chunk unnamed-chunk-19

(This code words if our tree is already in “cladewise” order and if our trait vectors x and y are factors. Otherwise it should be adapted with caution!)

Now let’s fit standard Mk models to x and y using fitMk as follows.

fit_x<-fitMk(tree,x,pi="fitzjohn")
fit_x

## Object of class "fitMk".
## 
## Fitted (or set) value of Q:
##           0         1
## 0 -0.080597  0.080597
## 1  0.080597 -0.080597
## 
## Fitted (or set) value of pi:
##        0        1 
## 0.901183 0.098817 
## due to treating the root prior as (a) nuisance.
## 
## Log-likelihood: -23.150054 
## 
## Optimization method used was "nlminb"
## 
## R thinks it has found the ML solution.

fit_y<-fitMk(tree,y,pi="fitzjohn")
fit_y

## Object of class "fitMk".
## 
## Fitted (or set) value of Q:
##           0         1
## 0 -0.080597  0.080597
## 1  0.080597 -0.080597
## 
## Fitted (or set) value of pi:
##        0        1 
## 0.098817 0.901183 
## due to treating the root prior as (a) nuisance.
## 
## Log-likelihood: -23.150054 
## 
## Optimization method used was "nlminb"
## 
## R thinks it has found the ML solution.

Nothing seems remiss so far. Let’s try to fit our Pagel ’94 model using phytools::fitPagel.

fit_xy<-fitPagel(tree,x,y,pi="fitzjohn")
fit_xy

## 
## Pagel's binary character correlation test:
## 
## Assumes "ARD" substitution model for both characters
## 
## Independent model rate matrix:
##           0|0       0|1       1|0       1|1
## 0|0 -0.151827  0.045559  0.106268  0.000000
## 0|1  0.106268 -0.212536  0.000000  0.106268
## 1|0  0.045558  0.000000 -0.091117  0.045559
## 1|1  0.000000  0.045558  0.106268 -0.151826
## 
## Dependent (x & y) model rate matrix:
##             0|0        0|1        1|0       1|1
## 0|0 -200.000000 100.000000 100.000000   0.00000
## 0|1    0.134109  -0.134109   0.000000   0.00000
## 1|0    0.090466   0.000000  -0.090466   0.00000
## 1|1    0.000000   0.000000  80.227651 -80.22765
## 
## Model fit:
##             log-likelihood      AIC
## independent      -45.66622 99.33243
## dependent        -20.15903 56.31807
## 
## Hypothesis test result:
##   likelihood-ratio:  51.0144 
##   p-value:  2.21685e-10 
## 
## Model fitting method used was fitMk 
## 
## R thinks both model optimizations converged.

Here R think that optimization has converged, but the perfectly round values for the transitions 0|0 \(\rightarrow\) 0|1 and 0|0 \(\rightarrow\) 1|0 should give us pause. Indeed, the default value of max.q (the maximum value an element of Q can assume in optimization) is \(100 \times\) the total tree height – which in our case we set to 1.0!

Let’s try again, but adjusting max.q.

fit_xy<-fitPagel(tree,x,y,pi="fitzjohn",max.q=1e4,
  logscale=TRUE)
fit_xy

## 
## Pagel's binary character correlation test:
## 
## Assumes "ARD" substitution model for both characters
## 
## Independent model rate matrix:
##           0|0       0|1       1|0       1|1
## 0|0 -0.151826  0.045559  0.106268  0.000000
## 0|1  0.106268 -0.212536  0.000000  0.106268
## 1|0  0.045559  0.000000 -0.091117  0.045559
## 1|1  0.000000  0.045559  0.106268 -0.151826
## 
## Dependent (x & y) model rate matrix:
##               0|0          0|1         1|0           1|1
## 0|0 -19786.264614 10000.000000 9786.264614      0.000000
## 0|1      0.070508    -0.141017    0.000000      0.070508
## 1|0      0.046192     0.000000   -0.092383      0.046192
## 1|1      0.000000 10000.000000 9786.840407 -19786.840407
## 
## Model fit:
##             log-likelihood      AIC
## independent      -45.66622 99.33243
## dependent        -20.00553 56.01107
## 
## Hypothesis test result:
##   likelihood-ratio:  51.3214 
##   p-value:  1.9124e-10 
## 
## Model fitting method used was fitMk 
## 
## R thinks both model optimizations converged.

We’re still hitting the upper bound – we could set max.q higher, but our results would stay (qualitatively) the same.

What’s going on here?

Well, in effect, we’ve manufactured data in which the character levels 0|0 and 1|1 are so unstable (indeed, they’re never observed among our tip taxa) that any high transition rate values away from these states into the stable combinations of 1|0 or 0|1 increase the probability of observing the data that we have.

Obviously, this example is artificial – but the same problem is likely to afflict real datasets for which x and y are not necessarily the same trait, but whose evolutionary history is highly correlated.

A take home message for me (as a developer) is that I need to make fitPagel more effective at diagnosing & reporting cases in which an optimization bound has been reached! Look for that in a future phytools update.

Phylogenetic Tools for Comparative Biology

Pages

Wednesday, June 4, 2025

Why parameter estimates in `fitPagel` can hit upper bounds, even if they don't in `fitMk` for each trait modeled separately....

No comments:

Post a Comment