Monday, December 11, 2023

Marginal likelihoods & plotting method for Γ-distributed rate heterogeneity model

Those that follow this blog will know that I’ve been posting (1, 2, 3) about a new phytools method (fitgammaMk) to fit a model of discrete character evolution in which edge rates come from a \(\Gamma\) distribution.

The way I did this was, in some ways, analogous to the discrete \(\Gamma\) model for rate variation among sites that’s used so commonly in molecular phylogenetic inference.

My most recent additions to this work come in the form of an option to export marginal scaled likelihoods of each edge being in each rate to the user; and a simple plot method for the object class.

Let me say a tiny bit about the former, which is the more interesting part.

The way I obtained marginal likelihoods for each edge rate was by going through the tree, branch by branch, fixing the rate for that branch to (in turn) each of the different discrete \(\Gamma\) distributed rates, computing the likelihood (integrating over all rates for all other branches of the tree) – and then simply repeating this for all the branches of the tree! (I then rescaled these marginal likelihoods so that they summed to 1.0, just as we do for ancestral state reconstruction.) This algorithm definitely gives me the right scaled likelihoods (this is very easy to show – maybe I’ll do that in a different post); however, it’s quite computationally expensive since I repeat pruning n \(\times\) k times, for n edges and k rate categories. I feel like I ought to be able to compute the same set of probabilities (as we do for marginal ancestral state reconstruction) via a second pre-order (root-to-tip) tree traversal. Unfortunately, I haven’t figured this out yet! For the time being, I just need to be patient.

Having computed the probabilities (marginal scaled likelihoods) that each edge is in each branch, getting a branch-rate merely involves computing the sum product of the probabilities and their rates.

OK. Let’s demo the workflow.

To give ourselves the absolutely best chance of successfully measuring the branch-rates of our character, I’m going to simulate an ordered multi-state character with many states but a single rate of change between states. (This is simply so that our data provide us with as much information as possible about evolutionary change on the tree. For more realistic discrete character data, we’d expect our performance to dip considerably!)

library(phytools)
packageVersion("phytools")
## [1] '2.0.11'
nlevs<-12
Model<-matrix(0,nlevs,nlevs,
  dimnames=list(letters[1:nlevs],letters[1:nlevs]))
for(i in 2:nlevs) Model[i-1,i]<-Model[i,i-1]<-1
Q<-Model
diag(Q)<--rowSums(Model)
Q
##    a  b  c  d  e  f  g  h  i  j  k  l
## a -1  1  0  0  0  0  0  0  0  0  0  0
## b  1 -2  1  0  0  0  0  0  0  0  0  0
## c  0  1 -2  1  0  0  0  0  0  0  0  0
## d  0  0  1 -2  1  0  0  0  0  0  0  0
## e  0  0  0  1 -2  1  0  0  0  0  0  0
## f  0  0  0  0  1 -2  1  0  0  0  0  0
## g  0  0  0  0  0  1 -2  1  0  0  0  0
## h  0  0  0  0  0  0  1 -2  1  0  0  0
## i  0  0  0  0  0  0  0  1 -2  1  0  0
## j  0  0  0  0  0  0  0  0  1 -2  1  0
## k  0  0  0  0  0  0  0  0  0  1 -2  1
## l  0  0  0  0  0  0  0  0  0  0  1 -1
tree<-pbtree(n=100)
alpha<-0.2
rates<-rgamma(n=nrow(tree$edge),
   alpha,alpha)
simtree<-tree
simtree$edge.length<-simtree$edge.length*rates
x<-sim.Mk(simtree,Q)
x<-to.matrix(x,letters[1:nlevs])
head(x)
##     a b c d e f g h i j k l
## t14 0 0 0 0 0 1 0 0 0 0 0 0
## t15 0 0 1 0 0 0 0 0 0 0 0 0
## t83 0 0 1 0 0 0 0 0 0 0 0 0
## t84 0 0 1 0 0 0 0 0 0 0 0 0
## t13 1 0 0 0 0 0 0 0 0 0 0 0
## t12 0 0 0 0 1 0 0 0 0 0 0 0

To get the marginal likelihoods we need to include the argument marginal=TRUE. Since this may take a while, the function prints us a handy warning message! In addition to marginal=TRUE, I’m also setting our transition model (to the ordered, one rate model that we used for simulation), and the number of rate categories for my discretized \(\Gamma\) distribution to nrates=10. By default, fitgammaMk will use parallelization via the foreach package.

fitgamma<-fitgammaMk(tree,x,rand_start=TRUE,min.alpha=0.01,
  model=Model,nrates=10,pi="fitzjohn",marginal=TRUE)
##   --
##   Computing marginal scaled likelihoods (in parallel) of each
##   rate on each edge. Caution: this is NOT fast....
##   --
fitgamma
## Object of class "fitgammaMk".
## 
## Fitted (or set) value of Q:
##           a         b         c         d         e         f         g         h
## a -1.022294  1.022294  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000
## b  1.022294 -2.044588  1.022294  0.000000  0.000000  0.000000  0.000000  0.000000
## c  0.000000  1.022294 -2.044588  1.022294  0.000000  0.000000  0.000000  0.000000
## d  0.000000  0.000000  1.022294 -2.044588  1.022294  0.000000  0.000000  0.000000
## e  0.000000  0.000000  0.000000  1.022294 -2.044588  1.022294  0.000000  0.000000
## f  0.000000  0.000000  0.000000  0.000000  1.022294 -2.044588  1.022294  0.000000
## g  0.000000  0.000000  0.000000  0.000000  0.000000  1.022294 -2.044588  1.022294
## h  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  1.022294 -2.044588
## i  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  1.022294
## j  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000
## k  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000
## l  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000
##           i         j         k         l
## a  0.000000  0.000000  0.000000  0.000000
## b  0.000000  0.000000  0.000000  0.000000
## c  0.000000  0.000000  0.000000  0.000000
## d  0.000000  0.000000  0.000000  0.000000
## e  0.000000  0.000000  0.000000  0.000000
## f  0.000000  0.000000  0.000000  0.000000
## g  0.000000  0.000000  0.000000  0.000000
## h  1.022294  0.000000  0.000000  0.000000
## i -2.044588  1.022294  0.000000  0.000000
## j  1.022294 -2.044588  1.022294  0.000000
## k  0.000000  1.022294 -2.044588  1.022294
## l  0.000000  0.000000  1.022294 -1.022294
## 
## Fitted (or set) value of alpha rate heterogeneity
## (with 10 rate categories): 0.098424
## 
## Fitted (or set) value of pi:
##        a        b        c        d        e        f        g        h        i 
## 0.215313 0.447150 0.323717 0.011271 0.001816 0.000503 0.000159 0.000051 0.000016 
##        j        k        l 
## 0.000004 0.000001 0.000000 
## due to treating the root prior as (a) nuisance.
## 
## Log-likelihood: -111.829667 
## 
## Optimization method used was "nlminb"
## 
## R thinks it has found the ML solution.

Now let’s plot our fitted edge rates.

Rates<-plot(fitgamma)

plot of chunk unnamed-chunk-46

(Remember, these are relative rates: from a \(\Gamma\) distribution that’s intentionally centered on 1.0.)

Our plot method invisibly returns the rates that it projects to the user, so (if we must) we can go back & compare these to the generating values.

plot(log(rates),log(Rates),bty="n",las=1,cex.axis=0.9,
  xlab="log(simulated)",ylab="log(estimated)",
  pch=16,col=make.transparent("blue",0.5),
  mar=c(5.1,4.1,1.1,1.1))
abline(lm(log(Rates)~log(rates)),lwd=2)

plot of chunk unnamed-chunk-47

By some sort of miracle, they’re correlated.

That’s it!

No comments:

Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.