Tuesday, September 24, 2024

Identifiability of the rateshift model in phytools (and some updates to help see cases of non-identifiability)

The phytools function rateshift fits a model with one or more temporal rate shifts that affect all the branches of the tree, identifying the location of these temporal shifts via Maximum Likelihood.

This method is described in my book with Luke Harmon, as well as in a number of prior posts on this blog over the years (e.g., 1, 2, 3).

A colleague recently shared an empirical example that he thought be experiencing issues with convergence (to the ML solution), but that we actually realized was an example of non-identifiability. Non-identifiability means that multiple sets of parameter values (or, equivalently, parameterizations) can lead to exactly the same maximum likelihood. Though non-identifiability had not previously been characterized for this particular model, it doesn’t surprise me. Indeed, it’s pretty easy to see for certain trivial cases how a rate shift model could be non-identifiable for the position of the shift and the rates of evolution, \(\sigma_{1}^2\) and \(\sigma_{2}^2\). For instance, if the shift occurs after the most recent branching time in the tree (that is, along a terminal edge for all tips) than there must be an infinite number of different combinations of rates and shift points that could explain our data with the same probability.

In practice, identifiability will depend on the specific data & tree, as well as the number of shifts in our model – but from a practical standpoint I might predict that non-identifiability is likely to be more common for smaller trees and more rate shifts.

There’s no way to solve this problem – however, I have updated phytools on GitHub so that it can be more easily seen. In particular, I made two significant changes. (1) rateshift runs multiple optimization iterations by default (and the model can be hard to optimize even when fully identifiable, so this is highly recommended). Previously, only the best of these optimization was retained. Now the results from all optimization iterations are kept and returned to the user as part of the model object. (2) rateshift previously assessed the frequency of convergence to the same best fit by measuring the number of optimization iterations with the same (best) likelihood score. This inadvertently assumes identifiability, which we can see is not guaranteed. Now rateshift not only assesses whether or not the same likelihood is obtained, but also whether or not the parameters of the fitted model are equal (to a reasonable machine precision).

To illustrate this, I’m going to use a simulated tree 100 units of depth and 100 taxa, with rate shifts in a quantitative character at 50, 70, and 90 time units above the root. The rate of evolution on these intervals (i.e., from 0 to 50, 50 to 70, etc.) will shift back and forth between 20 and 1.0, such that \(\sigma_{1}^2 = \sigma_{3}^2 = 20\) and \(\sigma_{2}^2 = \sigma_{4}^2 = 1.0\).

Here’s my simulation.

library(phytools)
packageVersion("phytools")
## [1] '2.3.15'
tree<-pbtree(n=100,scale=100)
tt<-make.era.map(tree,setNames(c(0,50,70,90),1:4))
cols<-setNames(palette()[1:4],1:4)
plot(ladderize.simmap(tt),cols,ftype="off")
legend("bottomleft",
  legend=c(expression(paste(sigma^2,"= 20")),
    expression(paste(sigma^2,"= 1")),
    expression(paste(sigma^2,"= 20")), 
    expression(paste(sigma^2,"= 1"))),
  col=cols,bty="n",cex=0.8,lwd=2)

plot of chunk unnamed-chunk-40

sig2<-setNames(c(20,1,20,1),1:4)
x<-sim.rates(tt,sig2)

Now let’s fit our one-rate, two-rate, three-rate, and four-rate models.

fit1<-rateshift(tree,x,parallel=TRUE)
## Opened cluster with 10 cores.
## Running optimization iterations in parallel.
## Please wait....
## Done.

This is a simple one-rate (i.e., no shift) Brownian model, so the model fit & parameters should closely match what we might obtain using, say, geiger::fitContinuous for the same model.

fit1
## ML 1-rate model:
##       s^2(1) se(1) k   logL
## value  6.413 0.907 2 -380.3
## 
## This is a one-rate model.
## 
## Model fit using ML.
## 
## Frequency of best fit: 1 
## 
## Optimization may not have converged.

We did (almost certainly unnecessarily) run 10 optimization iterations. To look at all 10 of them, we merely need to print fit1$all.fits as follows.

print(fit1$all.fits,digits=6)
##    sig2(1)   log(L)
## 1  6.41312 -380.338
## 2  6.41312 -380.338
## 3  6.41312 -380.338
## 4  6.41312 -380.338
## 5  6.41312 -380.338
## 6  6.41312 -380.338
## 7  6.41312 -380.338
## 8  6.41312 -380.338
## 9  6.41312 -380.338
## 10 6.41312 -380.338

Next, let’s fit our two-rate (i.e., one-shift model). This model seems to be typically identifiable (except in some trivial cases).

fit2<-rateshift(tree,x,nrates=2,parallel=TRUE)
## Opened cluster with 10 cores.
## Running optimization iterations in parallel.
## Please wait....
## Done.

Once again, we’ve run 10 optimization iterations, in parallel. We can print the best of these as follows.

fit2
## ML 2-rate model:
##       s^2(1) se(1) s^2(2) se(2) k   logL
## value  15.71 3.008 0.7328 0.164 4 -348.4
## 
## Shift point(s) between regimes (height above root):
##         1|2 se(1|2)
## value 89.35 0.05292
## 
## Model fit using ML.
## 
## Frequency of best fit: 0.4 
## 
## R thinks it has found the ML solution.

We can also look at all 10, just as we did before.

print(fit2$all.fits,digits=6)
##       1<->2  sig2(1)  sig2(2)   log(L)
## 4  89.35212  15.7140 0.732815 -348.419
## 7  89.35212  15.7140 0.732815 -348.419
## 6  89.35211  15.7141 0.732815 -348.419
## 3  89.35216  15.7140 0.732815 -348.419
## 5  90.13193  14.6454 0.706147 -348.492
## 10 52.53078  35.4834 4.573128 -372.138
## 1  52.53169  35.4818 4.573120 -372.138
## 8  52.50331  35.5323 4.573357 -372.138
## 9  52.49408  35.5488 4.573434 -372.138
## 2   2.70186 272.8927 5.999341 -378.500

Next, let’s fit a three-rate (two shift) model by setting nrates=3. This time, we’ll increase our optimization iterations to niter=40.

fit3<-rateshift(tree,x,nrates=3,parallel=TRUE,niter=40)
## Opened cluster with 15 cores.
## Running optimization iterations in parallel.
## Please wait....
## Done.
fit3
## ML 3-rate model:
##       s^2(1) se(1) s^2(2) se(2) s^2(3)  se(3) k   logL
## value  34.28  22.2  12.42 2.683 0.6886 0.1625 6 -347.4
## 
## Shift point(s) between regimes (height above root):
##        1|2 se(1|2)   2|3 se(2|3)
## value 44.4  0.4432 90.54  0.7465
## 
## Model fit using ML.
## 
## Frequency of best fit: 0.05 
## 
## R thinks it has found the ML solution.
print(fit3$all.fits,digits=6)
##       1<->2   2<->3   sig2(1)     sig2(2)  sig2(3)   log(L)
## 22 44.40279 90.5415  34.27730 1.24198e+01 0.688608 -347.404
## 6  44.40270 90.5404  34.27969 1.24214e+01 0.688657 -347.404
## 15 44.40238 90.5417  34.28125 1.24195e+01 0.688639 -347.404
## 8  44.40317 90.5423  34.27565 1.24180e+01 0.688566 -347.404
## 39 44.40369 90.5420  34.27685 1.24192e+01 0.688592 -347.404
## 37 44.40359 90.5427  34.27762 1.24181e+01 0.688562 -347.404
## 7  44.40197 90.5425  34.27598 1.24171e+01 0.688568 -347.404
## 26 44.40189 90.5404  34.30484 1.24190e+01 0.688641 -347.404
## 18 44.40134 90.5412  34.28298 1.24201e+01 0.688603 -347.404
## 10 44.40436 90.5433  34.25980 1.24179e+01 0.688557 -347.404
## 33 44.40451 90.5430  34.30243 1.24185e+01 0.688549 -347.404
## 13 44.40262 90.5371  34.27460 1.24257e+01 0.688784 -347.404
## 17 44.40781 90.5384  34.27022 1.24240e+01 0.688725 -347.404
## 27 44.39678 90.5440  34.28762 1.24153e+01 0.688459 -347.404
## 36 44.39422 90.5415  34.28011 1.24204e+01 0.688590 -347.404
## 16 19.85439 90.5148 127.39826 1.27072e+01 0.688729 -347.503
## 14 19.85444 90.5127 127.40406 1.27092e+01 0.688837 -347.503
## 2  19.85452 90.5144 127.36118 1.27076e+01 0.688752 -347.503
## 30 19.85446 90.5104 127.32434 1.27132e+01 0.688918 -347.503
## 11 19.85413 90.5259 127.58445 1.26925e+01 0.688268 -347.503
## 1  19.85442 90.5004 127.16962 1.27263e+01 0.689362 -347.503
## 19 90.57304 98.3022  14.26742 3.60017e-01 0.955852 -347.573
## 31 44.39361 89.3516  33.51793 1.41072e+01 0.738306 -347.647
## 28  4.65485 90.4419 112.88488 1.37823e+01 0.689071 -348.229
## 9   9.01433 90.4376  64.94502 1.37876e+01 0.689332 -348.229
## 29 85.31761 89.0376  14.63693 1.99498e+01 0.730945 -348.338
## 35 85.31819 89.0701  14.65244 1.96871e+01 0.730727 -348.338
## 4  78.81765 90.4886  16.32794 1.27410e+01 0.688944 -348.381
## 40 46.89670 47.0559  31.13711 2.25436e+03 4.508658 -371.464
## 24 53.94660 64.8511  35.58749 6.41312e-04 4.602282 -371.772
## 25 53.94284 64.8512  35.59533 6.41312e-04 4.602306 -371.772
## 12 53.88228 64.8513  35.72194 6.41312e-04 4.602670 -371.772
## 5  13.49694 51.0415   1.35391 4.30970e+01 4.574627 -372.052
## 38 13.91676 51.0459   2.62926 4.30805e+01 4.574617 -372.052
## 21 13.52555 51.0500   1.47633 4.30705e+01 4.574596 -372.052
## 20 13.89952 51.0514   2.60380 4.30612e+01 4.574554 -372.052
## 34 14.30243 51.0520   3.74658 4.30573e+01 4.574542 -372.052
## 3  33.98171 49.9846  24.28969 5.46582e+01 4.574736 -372.094
## 23 33.98210 49.9785  24.26218 5.47138e+01 4.574739 -372.094
## 32 33.99019 49.9858  24.30057 5.46518e+01 4.574719 -372.094

Finally, let’s fit a four-rate (three shift) model. This is our generating model, recall.

fit4<-rateshift(tree,x,nrates=4,parallel=TRUE,niter=40)
## Opened cluster with 15 cores.
## Running optimization iterations in parallel.
## Please wait....
## Done.
fit4
## ML 4-rate model:
##       s^2(1) se(1) s^2(2) se(2) s^2(3) se(3) s^2(4)  se(4) k   logL
## value  36.47 22.85 0.0006   NaN  17.08 5.634 0.7314 0.1634 8 -346.1
## 
## Shift point(s) between regimes (height above root):
##         1|2 se(1|2)   2|3 se(2|3)   3|4 se(3|4)
## value 50.96   2.908 73.58   7.417 89.35 0.06403
## 
## Model fit using ML.
## 
## Frequency of best fit: 0.025 
## 
## R thinks it has found the ML solution.
print(fit4$all.fits,digits=6)
##       1<->2   2<->3   3<->4       sig2(1)     sig2(2)     sig2(3)  sig2(4)   log(L)
## 35 50.95619 73.5847 89.3514  36.466139730 6.41312e-04 1.70774e+01 0.731419 -346.064
## 21 51.41255 73.5807 90.1285  36.323787084 6.41312e-04 1.54413e+01 0.706156 -346.180
## 6  51.43523 73.5806 90.1293  36.285506735 6.41312e-04 1.54361e+01 0.706058 -346.180
## 19 51.56293 73.5802 90.4154  36.307272388 6.41312e-04 1.48826e+01 0.689743 -346.193
## 13 44.40319 90.7219 98.3022  34.316455009 1.23956e+01 3.59282e-01 0.955749 -346.469
## 25 47.40261 90.6697 98.3020  32.084403249 1.24256e+01 3.62393e-01 0.955010 -346.507
## 20 19.85445 90.6545 98.3022 126.977206373 1.27408e+01 3.61529e-01 0.955089 -346.570
## 34 89.15406 92.9507 98.3021  15.267079475 3.67117e+00 1.92580e-02 1.102262 -346.584
## 16 16.88944 18.6322 90.5389   0.000641312 8.57362e+02 1.24381e+01 0.688655 -346.589
## 37 16.23655 18.0225 90.5387   0.000641312 8.32380e+02 1.24382e+01 0.688663 -346.589
## 17 17.12496 19.5547 90.5398   0.000641312 6.19482e+02 1.24366e+01 0.688784 -346.589
## 2  16.10806 19.3002 90.5380   0.000641312 4.70792e+02 1.24393e+01 0.688700 -346.589
## 12 16.35884 19.3601 90.5379   0.000641312 5.00869e+02 1.24392e+01 0.688682 -346.589
## 15 19.85437 73.5803 89.3500 186.713332381 7.92706e+00 1.65897e+01 0.731594 -347.182
## 24  6.61844 19.6110 90.3780   0.000641312 1.69604e+02 1.27957e+01 0.697330 -347.278
## 5  44.40289 90.6458 92.4383  34.285104952 1.24059e+01 6.41312e-04 0.695366 -347.320
## 26 12.76458 43.6126 90.5465   1.274688295 4.26147e+01 1.23701e+01 0.688575 -347.335
## 36 11.88099 42.8716 90.5494   0.000641312 4.29774e+01 1.23826e+01 0.688360 -347.337
## 7  10.36366 43.5612 90.5087   0.000641312 4.08145e+01 1.24321e+01 0.690062 -347.340
## 8  44.40268 89.1633 90.9771  33.947205564 1.31210e+01 6.39674e+00 0.689274 -347.359
## 22  5.94813 44.4042 90.5957   0.000641312 3.70621e+01 1.23217e+01 0.686894 -347.361
## 18 34.76740 42.7866 90.5450  26.938296985 5.80279e+01 1.23776e+01 0.688632 -347.391
## 3  34.24624 42.9133 90.5461  26.857711269 5.57060e+01 1.23744e+01 0.688597 -347.391
## 39 34.11522 42.9723 90.5476  26.880295158 5.49191e+01 1.23753e+01 0.688534 -347.391
## 14 38.02339 41.3611 90.5393  26.888304528 1.08108e+02 1.23844e+01 0.688863 -347.391
## 29 15.02103 47.4018 90.5482  13.365606330 3.64779e+01 1.23136e+01 0.688708 -347.401
## 27 87.94451 90.9643 98.3023  15.519961399 8.59925e+00 3.59875e-01 0.955312 -347.404
## 30 25.67121 47.4008 90.5471  36.953960376 2.91502e+01 1.23830e+01 0.688494 -347.436
## 11 10.48466 12.3464 90.4392  57.451832963 1.54761e+01 1.37864e+01 0.689281 -348.229
## 28 46.87783 48.4161 64.8510  32.229135384 2.29026e+02 6.41312e-04 4.601006 -371.395
## 9  10.07909 47.2400 47.2673   1.014494319 3.61426e+01 1.20976e+04 4.508585 -371.428
## 1  14.74720 47.3096 47.3977  12.111886231 3.61473e+01 3.72370e+03 4.508548 -371.428
## 10 10.85844 47.1833 47.2245   3.549465434 3.61396e+01 8.07123e+03 4.508307 -371.428
## 38 16.06587 16.6823 55.6433   0.000641312 1.98659e+03 2.13489e+01 4.597495 -371.572
## 32 17.54874 19.1682 55.6434   0.000641312 7.88941e+02 2.13468e+01 4.597305 -371.572
## 33 15.77858 19.4204 55.6434   0.000641312 3.52177e+02 2.13485e+01 4.597281 -371.572
## 31 19.66413 19.8196 57.4729   0.000641312 8.65602e+03 1.89407e+01 4.606644 -371.617
## 23 13.96114 52.2688 64.8511   1.881912259 4.38632e+01 6.41312e-04 4.603764 -371.682
## 40 34.08048 50.9497 64.8512  23.830225162 5.68612e+01 6.41312e-04 4.603462 -371.723
## 4  33.97876 50.8104 64.9235  23.209305663 5.83322e+01 6.41312e-04 4.606284 -371.729

Although we'd be much more confident in our result if we'd converged on the same ML solution more than once, in this simulated example we’re not seeing any conspicuous problems with identifiability – which would be indicated by multiple independent optimization iterations converging on different solutions with the same likelihood. (We also seem to be getting pretty close to the true shift-points, which is neat.) They can occur though, we’ve established, so users of this method should be careful about doubly scrutinizing their findings as a result.

Edit: I repeated the last analysis with 100 optimization instead of 40 and the results holded steady.

That’s all folks!