Thursday, December 7, 2023

A variable rate discrete trait evolution model with Γ-distributed rates among edges

I just added a new function to phytools, fitgammaMk, that fits an extended Mk discrete character evolution model, but with \(\Gamma\)-distributed rates among edges of the tree.

It does this in the same way as we might for variable rates across sites in a nucleotide sequence alignment during phylogeny inference, using a discretized \(\Gamma\) in which the number of rate categories is set by the user.

Since this wasn’t all that difficult to do, I have to imagine that something similar already exists elsewhere – possibly in RevBayes or other software.

So far, I haven’t yet added any interesting features to this function, such as ancestral state estimation under the model or assignment of rates to edges.

To start, all I’d like to show here is that we can use it to (firstly) measure \(\Gamma\)-distributed rate heterogeneity if it it exists in our data, and (secondly) more accurately estimate the underlying transition process, given by the Q matrix, when we take rate heterogeneity among edges into account.

To follow along, readers will need to update phytools from GitHub, which can be done using the remotes package (e.g., remotes::install_github("liamrevell/phytools")).

library(phytools)
## Loading required package: ape
## Loading required package: maps
packageVersion("phytools")
## [1] '2.0.8'

To show that we can estimate rate heterogeneity where it exists, I’ll use a numerical simulation. I’ll start by simulating a pure-birth tree of 1,000 taxa. Then I’ll simulate \(\Gamma\)-distributed rates among sites with a shape parameter, \(\alpha\) of \(\alpha\) = 0.3. Finally, I’ll simulate a trait evolving under this scenario of rate variation among edges. Note that \(\alpha\) = 0.3 corresponds to quite high heterogeneity in the rate of evolution across the branches of our tree!

## simulate tree
tree<-pbtree(n=1000)
tree
## 
## Phylogenetic tree with 1000 tips and 999 internal nodes.
## 
## Tip labels:
##   t358, t660, t787, t788, t659, t210, ...
## 
## Rooted; includes branch lengths.
## create transition matrix for simulation
Q<-matrix(c(-1.5,1,0.5,
  1,-2,1,
  0.5,1,-1.5),3,3,byrow=TRUE,
  dimnames=list(0:2,0:2))
Q
##      0  1    2
## 0 -1.5  1  0.5
## 1  1.0 -2  1.0
## 2  0.5  1 -1.5
## simulate the rates under Gamma rate heterogeneity
alpha<-0.3
rates<-rgamma(n=nrow(tree$edge),
  alpha,alpha)
hist(rates,breaks=30,main="Rate variation among edges",
  las=1,cex.axis=0.7)

plot of chunk unnamed-chunk-5

The final step is simulating our character. To do that, I’ll just take my rates from the previous step and then multiply them by the edges of my simulated pure-birth tree to create a new phylogeny for simulation (simtree, below). I can then simulate on this modified tree using the standard Mk model simulator of phytools, sim.Mk.

For good measure, I’ll visualize the simulating tree with the generating rates overlain via the viridis color palette, just to help the reader get an appreciation of the kind of rate heterogeneity we’re talking about.

simtree<-tree
simtree$edge.length<-simtree$edge.length*rates
x<-sim.Mk(simtree,Q)
head(x)
## t358 t660 t787 t788 t659 t210 
##    1    2    2    2    2    2 
## Levels: 0 1 2
foo<-function(...) viridisLite::viridis(...,
  option="A",direction=-1)
plotBranchbyTrait(simtree,rates,mode="edges",
  show.tip.label=FALSE,edge.width=1,
  palette=foo,title="relative rate",
  legend=max(nodeHeights(simtree))/2)

plot of chunk unnamed-chunk-7

Now we’re ready to fit our models.

To begin, I’ll just fit a constant rates model. This can be done using fitMk in the typical ways. Since my generating model from before was symmetric, I’ll set model="SYM" here.

fit1<-fitMk(tree,x,rand_start=TRUE,model="SYM")
fit1
## Object of class "fitMk".
## 
## Fitted (or set) value of Q:
##           0         1         2
## 0 -0.722220  0.398998  0.323222
## 1  0.398998 -0.934550  0.535552
## 2  0.323222  0.535552 -0.858774
## 
## Fitted (or set) value of pi:
##        0        1        2 
## 0.333333 0.333333 0.333333 
## due to treating the root prior as (a) flat.
## 
## Log-likelihood: -943.033743 
## 
## Optimization method used was "nlminb"
## 
## R thinks it has found the ML solution.

Next, we’ll fit our rate heterogeneous model.

Just as for molecular evolution, the \(\Gamma\) model we use is a discretized \(\Gamma\), so we need to indicate the number of rate categories (nrates) – which defaults (if we don’t set it) to nrates=4.

Unfortunately, this can sometimes be a tricky model to optimize (or perhaps I just haven’t figured out the details yet), so I’m going to run multiple optimization iterations. I’ll parallelize these using the foreach package so that they don’t take all day to run.

niter<-10
ncores<-min(c(parallel::detectCores()-2,niter))
mc<-parallel::makeCluster(ncores,type="PSOCK")
doParallel::registerDoParallel(cl=mc)
library(foreach)
fits<-foreach(i=1:niter)%dopar%{
  phytools::fitgammaMk(tree,x,rand_start=TRUE,model="SYM")
}
parallel::stopCluster(cl=mc)
logL<-sapply(fits,logLik)
logL
##  [1] -938.5193 -938.5193 -975.5310 -971.7774 -938.5193 -971.8135 -971.7774 -971.8136 -971.7774 -938.5193
fit4<-fits[[which(logL==max(logL))[1]]]
fit4
## Object of class "fitgammaMk".
## 
## Fitted (or set) value of Q:
##           0         1         2
## 0 -1.200290  0.722021  0.478269
## 1  0.722021 -1.676490  0.954469
## 2  0.478269  0.954469 -1.432738
## 
## Fitted (or set) value of alpha rate heterogeneity
## (with 4 rate categories): 0.359086
## 
## Fitted (or set) value of pi:
##        0        1        2 
## 0.333333 0.333333 0.333333 
## due to treating the root prior as (a) flat.
## 
## Log-likelihood: -938.519292 
## 
## Optimization method used was "nlminb"
## 
## R thinks it has found the ML solution.

Models with different numbers of rate categories do not consume more parameters in this model – just the computational effort required to calculate the likelihood; however, we can try nrates=10 to see if and how this affects our result. Once again, I’m going to run 10 optimization iterations to be on the safe side.

niter<-10
ncores<-min(c(parallel::detectCores()-2,niter))
mc<-parallel::makeCluster(ncores,type="PSOCK")
doParallel::registerDoParallel(cl=mc)
library(foreach)
fits<-foreach(i=1:niter)%dopar%{
  phytools::fitgammaMk(tree,x,rand_start=TRUE,model="SYM",
    nrates=10)
}
parallel::stopCluster(cl=mc)
logL<-sapply(fits,logLik)
logL
##  [1] -938.3695 -938.3695 -938.3695 -938.3695 -938.3695 -938.3695 -938.3695 -938.3695 -938.3695 -938.3695
fit10<-fits[[which(logL==max(logL))[1]]]
fit10
## Object of class "fitgammaMk".
## 
## Fitted (or set) value of Q:
##           0         1         2
## 0 -1.286695  0.775940  0.510755
## 1  0.775940 -1.809425  1.033485
## 2  0.510755  1.033485 -1.544240
## 
## Fitted (or set) value of alpha rate heterogeneity
## (with 10 rate categories): 0.375444
## 
## Fitted (or set) value of pi:
##        0        1        2 
## 0.333333 0.333333 0.333333 
## due to treating the root prior as (a) flat.
## 
## Log-likelihood: -938.369467 
## 
## Optimization method used was "nlminb"
## 
## R thinks it has found the ML solution.

In general, we can see that we both estimated \(\alpha\) pretty well (remember, the generating value was \(\alpha\) = 0.3), and we get much closer to the true underlying Q matrix when we take into account branch rate heterogeneity – even better for nrates = 10 than nrates = 4.

Lastly, for completeness, let’s try fitting a totally different type of rate heterogeneous model: the hidden-rates model.

Under the hidden-rates model, different rates of trait evolution are captured by imagining that there are a set of hidden character conditions, each with different transition rates to other states, that underlie our set of observed traits. This time multiple (parallelized) optimization iterations are run by default!

fithrm<-fitHRM(tree,x,model="SYM",
  parallel=TRUE)
## 
## This is the design matrix of the fitted model.
## Does it make sense?
## 
##    0 0* 1 1* 2 2*
## 0  0  1 2  0 5  0
## 0* 1  0 0  3 0  7
## 1  2  0 0  4 6  0
## 1* 0  3 4  0 0  8
## 2  5  0 6  0 0  9
## 2* 0  7 0  8 9  0
## 
## Opened cluster with 10 cores.
## Running optimization iterations in parallel.
## Please wait....
fithrm
## Object of class "fitHRM".
## 
## Observed states: [ 0, 1, 2 ]
## Number of rate categories per state: [ 2, 2, 2 ]
## 
## Fitted (or set) value of Q:
##            0       0*         1        1*         2        2*
## 0  -6.175671  3.82394  1.624893  0.000000  0.726838  0.000000
## 0*  3.823940 -3.82394  0.000000  0.000000  0.000000  0.000000
## 1   1.624893  0.00000 -3.130241  0.366517  1.138830  0.000000
## 1*  0.000000  0.00000  0.366517 -0.682846  0.000000  0.316329
## 2   0.726838  0.00000  1.138830  0.000000 -2.736238  0.870569
## 2*  0.000000  0.00000  0.000000  0.316329  0.870569 -1.186898
## 
## Fitted (or set) value of pi:
##        0       0*        1       1*        2       2* 
## 0.166667 0.166667 0.166667 0.166667 0.166667 0.166667 
## due to treating the root prior as (a) flat.
## 
## Log-likelihood: -935.141791 
## 
## Optimization method used was "optim"
## 
## R thinks it has found the ML solution.

Comparing among this set of four models is as easy as running an anova function call on the set – although, remember, fit4 and fit10 are not distinctly different models, just different simplifications of the same underlying conceptual model!

anova(fit1,fit4,fit10,fithrm)
##           log(L) d.f.      AIC     weight
## fit1   -943.0337    3 1892.067 0.01245987
## fit4   -938.5193    4 1885.039 0.41862024
## fit10  -938.3695    4 1884.739 0.48628222
## fithrm -935.1418    9 1888.284 0.08263766

Our two different \(\Gamma\) models capture the overwhelming portion of support here -- and this should be unsurprising, because that was the generating model!

Cool. That’s it! Hopefully the next application will be ancestral state reconstruction or estimating the branch wise rates.

No comments:

Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.