Tuesday, January 2, 2024

Mk vs. Γ vs. hidden-rates model for regime shifts on the tree in the evolution of a discrete phenotypic trait

Recently, I’ve been blogging about a new \(\Gamma\) rate heterogeneity model for discrete phenotypic characters in the phytools package (e.g., 1, 2, 3, 4, 5). According to this model, edge rates for a discrete phenotypic trait vary according a discretized \(\Gamma\) distribution with shape parameter \(\alpha\). \(\beta = \alpha\) so the \(\Gamma\) mean is 1.0, and low values of \(\alpha\) correspond to high rate heterogeneity among branches (& vice versa).

Something I thought might be interesting to explore would be to see how the model behaves when (instead of varying according a \(\Gamma\) distribution) there are one or a small number of discrete shifts in the rate of evolution on certain pre-defined branches of the tree.

Here, I’m going to simulate exactly that type of rate variation for an ordered, 3-state character with two rate shifts. I’ll do it using phytools functions pbtree (to simulate the tree), paintSubTree (to paint our rate regimes), and sim.multiMk (to simulate our rates).

## load packages
library(phytools)
## set seed & simulate tree
set.seed(99)
tree<-pbtree(n=200,scale=1)
## paint regimes
sim_tree<-paintSubTree(tree,node=316,state="1",
  anc.state="0",stem=0.5)
sim_tree<-paintSubTree(sim_tree,node=228,
  state="1",stem=0.5)
## simulate trait with two Q-matrices
Q<-matrix(c(
  -1,1,0,
  1,-2,1,
  0,1,-1),3,3,
  dimnames=list(letters[1:3],letters[1:3]))
simQ<-setNames(list(0.5*Q,20*Q),0:1)
x<-sim.multiMk(sim_tree,simQ)

Let’s visualize this on our tree. I’m going to use a hack that you might learn by reading Chapter 13 of my recent book with Luke Harmon to map our discrete character at the tips of the tree.

cols<-setNames(c("blue","red"),0:1)
plot(sim_tree,cols,ftype="off",lwd=1,
  direction="upwards")
pp<-get("last_plot.phylo",envir=.PlotPhyloEnv)
x_cols<-setNames(viridisLite::viridis(n=3),
  levels(x))[x[tree$tip.label]]
segments(x0=pp$xx[1:Ntip(tree)],x1=pp$xx[1:Ntip(tree)],
  y0=pp$yy[1:Ntip(tree)], y1=pp$yy[1:Ntip(tree)]+0.1,lwd=3,
  col=x_cols)
legend("bottomleft",letters[1:3],lwd=3,
  col=viridisLite::viridis(n=3),bty="n")

plot of chunk unnamed-chunk-7

To start, let’s fit a standard (homogeneous rate) Mk model to these data. I’m going to use the structure of my generating Q matrices, as follows.

MODEL<-Q
diag(MODEL)<-0
MODEL
##   a b c
## a 0 1 0
## b 1 0 1
## c 0 1 0
fit_mk<-fitMk(tree,x,model=MODEL,pi="fitzjohn",
  rand_start=TRUE)
fit_mk
## Object of class "fitMk".
## 
## Fitted (or set) value of Q:
##           a         b         c
## a -1.704026  1.704026  0.000000
## b  1.704026 -3.408051  1.704026
## c  0.000000  1.704026 -1.704026
## 
## Fitted (or set) value of pi:
##        a        b        c 
## 0.230061 0.504219 0.265720 
## due to treating the root prior as (a) nuisance.
## 
## Log-likelihood: -192.777337 
## 
## Optimization method used was "nlminb"
## 
## R thinks it has found the ML solution.

Cool. OK, now we can compare this to our new \(\Gamma\) model. Remember, our generating process is not \(\Gamma\)-distributed rates across edges! We’re just curious to see if the real process (discrete regime shifts) is captured at all by fitgammaMk.

fit_gamma<-fitgammaMk(tree,x,model=MODEL,nrates=8,
  pi="fitzjohn",marginal=TRUE,rand_start=TRUE)
##   --
##   Computing marginal scaled likelihoods (in parallel) of each
##   rate on each edge. Caution: this is NOT fast....
##   --
fit_gamma
## Object of class "fitgammaMk".
## 
## Fitted (or set) value of Q:
##          a        b        c
## a -11.4076  11.4076   0.0000
## b  11.4076 -22.8152  11.4076
## c   0.0000  11.4076 -11.4076
## 
## Fitted (or set) value of alpha rate heterogeneity
## (with 8 rate categories): 0.102942
## 
## Fitted (or set) value of pi:
##        a        b        c 
## 0.303173 0.463638 0.233189 
## due to treating the root prior as (a) nuisance.
## 
## Log-likelihood: -181.833211 
## 
## Optimization method used was "nlminb"
## 
## R thinks it has found the ML solution.

Clearly, this model fit much better than a homogenous rate Mk model, but let’s see if the marginal edge rates reflect what we know to be the true rate heterogeneity of our data.

par(mfrow=c(1,2))
plot(sim_tree,cols,ftype="off",lwd=1)
plot(fit_gamma,ftype="off")

plot of chunk unnamed-chunk-11

Awesome. That’s really not bad.

Of course, if we have a hypothesis of discrete regime shifts, but we don’t know where these might have occurred, a more appropriate model to fit would be the hidden-rates model of Beaulieu et al. (2013). In phytools, this can be doing using fitHRM. Unfortunately, fitHRM doesn’t allow us to specify a custom (e.g., ordered) transition model. Fortunately, it’s pretty easy to pull the design matrix out of a fitted model object & then update it to match our hypothesized Q matrix structure. Let’s do it.

## fit equal-rates HRM model (just to get object)
fit_hrm<-fitHRM(tree,x,model="ER",ncat=2,pi="fitzjohn",
  corHMM_model=TRUE,niter=1)
## 
## This is the design matrix of the fitted model.
## Does it make sense?
## 
##    a a* b b* c c*
## a  0  3 1  0 1  0
## a* 3  0 0  2 0  2
## b  1  0 0  3 1  0
## b* 0  2 3  0 0  2
## c  1  0 1  0 0  3
## c* 0  2 0  2 3  0
## 
## log-likelihood from current iteration: -160.9726 
##  --- Best log-likelihood so far: -160.9726 ---
## pull out model structure
hrm_MODEL<-fit_hrm$index.matrix
hrm_MODEL
##    a a* b b* c c*
## a  0  3 1  0 1  0
## a* 3  0 0  2 0  2
## b  1  0 0  3 1  0
## b* 0  2 3  0 0  2
## c  1  0 1  0 0  3
## c* 0  2 0  2 3  0
## update model structure to be ordered a<->b<->c
## for each rate level
hrm_MODEL[1,5]<-hrm_MODEL[2,6]<-hrm_MODEL[5,1]<-
  hrm_MODEL[6,2]<-0
hrm_MODEL
##    a a* b b* c c*
## a  0  3 1  0 0  0
## a* 3  0 0  2 0  0
## b  1  0 0  3 1  0
## b* 0  2 3  0 0  2
## c  0  0 1  0 0  3
## c* 0  0 0  2 3  0
## fit hidden-rates model using fitMk
fit_hrm<-fitMk(tree,fit_hrm$data,model=hrm_MODEL,pi="fitzjohn",
  rand_start=TRUE)
fit_hrm
## Object of class "fitMk".
## 
## Fitted (or set) value of Q:
##            a        a*         b        b*         c        c*
## a  -24.13886  0.109710  24.02915  0.000000   0.00000  0.000000
## a*   0.10971 -0.437854   0.00000  0.328144   0.00000  0.000000
## b   24.02915  0.000000 -48.16801  0.109710  24.02915  0.000000
## b*   0.00000  0.328144   0.10971 -0.765997   0.00000  0.328144
## c    0.00000  0.000000  24.02915  0.000000 -24.13886  0.109710
## c*   0.00000  0.000000   0.00000  0.328144   0.10971 -0.437854
## 
## Fitted (or set) value of pi:
##        a       a*        b       b*        c       c* 
## 0.012770 0.161929 0.010274 0.735779 0.008636 0.070611 
## due to treating the root prior as (a) nuisance.
## 
## Log-likelihood: -156.858431 
## 
## Optimization method used was "nlminb"
## 
## R thinks it has found the ML solution.

OK, obviously this model fits way better – just as one might expect it would. But how do we map our rate regimes back onto the edges of the tree? The easiest way to do this (IMO) is to perform stochastic mapping on the fitted model, merge by rate regime level, and then visual our merged maps using phytools::densityMap.

Let’s try it.

## perform stochastic mapping
hrm_smap<-simmap(fit_hrm)
hrm_smap
## 100 phylogenetic trees with mapped discrete characters
hrm_smap<-mergeMappedStates(hrm_smap,c("a*","b*","c*"),
  "cold")
hrm_smap<-mergeMappedStates(hrm_smap,c("a","b","c"),
  "warm")
## computing "densityMap" object
dmap<-densityMap(hrm_smap,plot=FALSE)
## sorry - this might take a while; please be patient
## finally, let's create our plot
par(mfrow=c(1,2))
plot(sim_tree,cols,ftype="off",lwd=1)
Nt<-Ntip(tree)
h<-max(nodeHeights(tree))
plot(dmap,ftype="off",lwd=2,legend=FALSE,
  xlim=c(-0.3,1))
h <- max(nodeHeights(tree))
LWD <- diff(par()$usr[1:2])/dev.size("px")[1]
Nt <- Ntip(tree)
lines(x = rep(-0.25 * h + LWD * 15/2, 2), y = c(1 + 1/40 * 
    Nt, Nt - 1/40 * Nt))
nticks <- 11
Y <- cbind(seq(1 + 1/40 * Nt, Nt - 1/40 * Nt, length.out = nticks), 
  seq(1 + 1/40 * Nt, Nt - 1/40 * Nt, length.out = nticks))
X <- cbind(rep(-0.25 * h + LWD * 15/2, nticks), rep(-0.25 * 
    h + LWD * 15/2 + 0.02 * h, nticks))
for (i in 1:nrow(Y)) lines(X[i, ], Y[i, ])
add.color.bar(Nt-2/40*Nt-1,dmap$cols,
  x=-0.25*h,y=1+1/40*Nt,
  title="probability of being in \"warm\" state",
  lims=NULL,digits=3,direction="upwards",
  subtitle="",lwd=15,prompt=FALSE)
ticks<-seq(0,1.0,length.out=11)
text(x = X[, 2], y = Y[, 2], signif(ticks, 2), pos = 4, 
  cex = 0.7)

plot of chunk unnamed-chunk-19

Wow. The distribution of “warm” rates match our simulation conditions with near exact precision. Cool!

That’s all for now.

No comments:

Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.