Sunday, September 20, 2020

A hidden-rate discrete character evolution model in phytools

Over the past several months Luke Harmon and I have been working on a project that has led me to make lots of different updates and additions to phytools.

This afternoon, I submitted a new phytools version to CRAN and it is already available (only as source; Windows & Mac binaries usually take a few days to be built).

I haven't had time in recent weeks to post about the new package updates that are in this current version. I'm going to try to make a few blog entries now, however, that describe some of the updates and new features of phytools.

The first of these that I'll mention is a new function (called fitHRM) that can be used to fit the hidden-rate-model of Beaulieu et al. (2013).

This model can already be fit to discrete character data in R using the package corHMM. In fact, I suspect that the implementation of corHMM is more robust and scales better to large trees. (I have not yet tried to use fitHRM with very large phylogeny - but I know that corHMM in the corHMM package has been used in empirical studies involving up to 1000s of taxa.)

The idea of this model is that we observe evolution between the states in a certain state space (e.g., abc, etc.), but there also exists unobserved conditions for the each state (or some states, see below) with different rates of change to other states (e.g., a', b', c', and so on).

A simple example of this might be the case of a binary trait in which condition 0 (say) existed in two hidden states: 0' (hot) and 0'' (cold). When in observed condition 0 has the hidden state 0' the change 0 ↔ 1 is allowed, but when it's in state 0'' the change 0 ↔ 1 cannot occur.

To start, let's simulate under this exactly model & see what evolution looks like!

packageVersion("phytools")
## [1] '0.7.70'
library(phytools)
tree<-pbtree(n=200,scale=1)
Q<-matrix(c(
    -0.5,0.5,0.0,
    0.5,-4.5,4.0,
    0.0,4.0,-4.0),3,3,byrow=TRUE,
    dimnames=list(c("0**","0*","1"),
    c("0**","0*","1")))
x<-sim.Mk(tree,Q,anc="0**")
plotTree(tree,type="fan",lwd=1,ftype="off")
pp<-get("last_plot.phylo",envir=.PlotPhyloEnv)
cols<-setNames(c("grey","grey","black"),
    levels(x))
pch<-setNames(c(21,23,21),levels(x))
cex<-setNames(c(1.3,1.1,1.3),levels(x))
points(pp$xx[1:Ntip(tree)],pp$yy[1:Ntip(tree)],
    pch=pch[x],bg=cols[x],cex=cex[x])
legend("topleft",c("0","1"),pch=22,pt.cex=1.5,
    pt.bg=cols[c("0*","1")],
    title="Observed state")
legend("topright",levels(x),pch=pch,
    pt.cex=c(1.7,1.5,1.5),
    pt.bg=cols,title="Hidden state")

plot of chunk unnamed-chunk-2

The idea is we only “see” two values for our trait at each tip in the tree: 0 or 1. Two different conditions for state 0 are unobserved: 0* with a high rate of evolution to 1; and 0** that can't change to 1 (except by way of changing first to state 0*).

The effect of this hidden-rate heterogeneity on the trait distribution of 0 & 1 on our tree is extremely evident.

For instance, in one part of the tree we see a large, deeply divergent clade that is entirely in state 0 (for hidden state 0**). In other regions of the phylogeny 0 & 1 change back and forth easily (for hidden state 0*).

Let's merge our 0* & 0** into a single character (0) to see if we can fit a hidden-rates model to our data using the new phytools function fitHRM.

y<-x
levels(y)<-c(0,0,1)
head(y,100)
##  t17 t183 t184  t37  t38  t25 t151 t152  t98  t93  t94  t13  t54  t55  t39  t83 t190 t191 
##    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0 
## t137 t138  t15  t16 t120 t121  t28  t29 t148 t149 t111  t69  t34  t35 t119 t150 t167 t168 
##    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0 
## t169 t170 t186 t187  t73  t74  t31  t56  t57 t134 t135 t112 t117 t118   t8   t9  t12  t67 
##    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0 
##  t68  t32 t178 t179   t3  t81  t82  t70  t79  t80  t45  t75  t76  t49  t20 t124 t171 t172 
##    0    0    0    0    1    0    1    1    1    1    1    0    1    1    1    1    0    1 
##  t51  t52   t2  t36 t104 t105  t86  t19  t71  t72  t64  t63 t115 t116 t175 t176 t159 t160 
##    0    1    1    1    0    1    1    0    0    1    1    0    1    0    1    1    0    0 
## t113 t128 t129  t23 t195 t196 t177 t173 t174  t41 
##    0    1    1    1    1    1    0    0    0    0 
## Levels: 0 1

(y is the trait vector that we observe, remember. We have no idea whether each tip of the tree is in hidden-state 0* or 0**.)

Now let's fit our model. For our hidden rates model we imagine that there are two rate categories (i.e., two hidden states) for 0, but only one rate category (no hidden states) for 1. We specify this model design using the argument ncat=c(2,1). Note that this model explicitly prohibits transitions from 0** (actually, we're going to call them 0 and 0* in the fitted model) to 1.

hrm<-fitHRM(tree,y,ncat=c(2,1),pi="fitzjohn")
## 
## This is the design matrix of the fitted model.
## Does it make sense?
## 
##    0 0* 1
## 0  0  1 2
## 0* 3  0 0
## 1  4  0 0
## 
## log-likelihood from current iteration: -85.3678989795076 
##  --- Best log-likelihood so far: -85.3678989795076 ---
## log-likelihood from current iteration: -85.3678989789946 
##  --- Best log-likelihood so far: -85.3678989789946 ---
## log-likelihood from current iteration: -85.3679184420598 
##  --- Best log-likelihood so far: -85.3678989789946 ---
## log-likelihood from current iteration: -85.3678989790264 
##  --- Best log-likelihood so far: -85.3678989789946 ---
## log-likelihood from current iteration: -85.3678989790468 
##  --- Best log-likelihood so far: -85.3678989789946 ---
## log-likelihood from current iteration: -85.3678989856971 
##  --- Best log-likelihood so far: -85.3678989789946 ---
## log-likelihood from current iteration: -85.3678989861835 
##  --- Best log-likelihood so far: -85.3678989789946 ---
## log-likelihood from current iteration: -85.3678989883086 
##  --- Best log-likelihood so far: -85.3678989789946 ---
## log-likelihood from current iteration: -85.3678992525849 
##  --- Best log-likelihood so far: -85.3678989789946 ---
## log-likelihood from current iteration: -85.367898979139 
##  --- Best log-likelihood so far: -85.3678989789946 ---
hrm
## Object of class "fitHRM".
## 
## Observed states: [ 0, 1 ]
## Number of rate categories per state: [ 2, 1 ]
## 
## Fitted (or set) value of Q:
##            0       0*         1
## 0  -6.706022 0.580216  6.125806
## 0*  0.000000 0.000000  0.000000
## 1   3.575458 0.000000 -3.575458
## 
## Fitted (or set) value of pi:
##        0       0*        1 
## 0.707789 0.000000 0.292211 
## due to treating the root prior as (a) nuisance.
## 
## Log-likelihood: -85.367899 
## 
## Optimization method used was "nlminb"

In addition to the hidden rates model, we can also fit a standard Mk model. The hidden rates model has this model as a special case, so we can compare their likelihoods easily.

ard<-fitMk(tree,y,model="ARD",pi="fitzjohn")
ard
## Object of class "fitMk".
## 
## Fitted (or set) value of Q:
##           0         1
## 0 -0.245458  0.245458
## 1  1.882550 -1.882550
## 
## Fitted (or set) value of pi:
##        0        1 
## 0.145707 0.854293 
## due to treating the root prior as (a) nuisance.
## 
## Log-likelihood: -99.131646 
## 
## Optimization method used was "nlminb"

In addition to the verbal hidden-rates model that we've described, let's also fit a second hidden rates model in which instead of two rate categories for 0 and two for 1 we have two rate categories (that is, two hidden states) for each observed level of our trait. (This is actually the same model that's fit by the corHMM.)

hrm2<-fitHRM(tree,y,ncat=2,pi="fitzjohn")
## 
## This is the design matrix of the fitted model.
## Does it make sense?
## 
##    0 0* 1 1*
## 0  0  1 2  0
## 0* 3  0 0  4
## 1  5  0 0  6
## 1* 0  7 8  0
## 
## log-likelihood from current iteration: -83.0626425938858 
##  --- Best log-likelihood so far: -83.0626425938858 ---
## log-likelihood from current iteration: -85.3679000610412 
##  --- Best log-likelihood so far: -83.0626425938858 ---
## log-likelihood from current iteration: -83.0626429300241 
##  --- Best log-likelihood so far: -83.0626425938858 ---
## log-likelihood from current iteration: -83.0626392569354 
##  --- Best log-likelihood so far: -83.0626392569354 ---
## log-likelihood from current iteration: -85.5141689438441 
##  --- Best log-likelihood so far: -83.0626392569354 ---
## log-likelihood from current iteration: -85.4894151270034 
##  --- Best log-likelihood so far: -83.0626392569354 ---
## log-likelihood from current iteration: -85.4899853283194 
##  --- Best log-likelihood so far: -83.0626392569354 ---
## log-likelihood from current iteration: -83.0626412601082 
##  --- Best log-likelihood so far: -83.0626392569354 ---
## log-likelihood from current iteration: -85.5159297400412 
##  --- Best log-likelihood so far: -83.0626392569354 ---
## log-likelihood from current iteration: -83.062694408704 
##  --- Best log-likelihood so far: -83.0626392569354 ---
hrm
## Object of class "fitHRM".
## 
## Observed states: [ 0, 1 ]
## Number of rate categories per state: [ 2, 1 ]
## 
## Fitted (or set) value of Q:
##            0       0*         1
## 0  -6.706022 0.580216  6.125806
## 0*  0.000000 0.000000  0.000000
## 1   3.575458 0.000000 -3.575458
## 
## Fitted (or set) value of pi:
##        0       0*        1 
## 0.707789 0.000000 0.292211 
## due to treating the root prior as (a) nuisance.
## 
## Log-likelihood: -85.367899 
## 
## Optimization method used was "nlminb"

This model (by default - we can also set it to be ordered=TRUE to prevent this) allows the transitions 0 $harr; 1 and 0*1*, as well as 00* and 11*.

Here's a plot of our four different models so you have a better idea.

layout(matrix(c(1,2,3,3),2,2,,byrow=TRUE))
plot(ard)
mtext("a) ARD Mk model",line=1,adj=0,font=3)
plot(hrm)
mtext("b) HRM model with 1 hidden state",line=1,
    adj=0,font=3)
plot(hrm2)
mtext("c) HRM model with 2 hidden states",line=1,
    adj=0,font=3)

plot of chunk unnamed-chunk-7

Let's put all our models into a table and compare them.

data.frame(model=c("ARD","HRM-1","HRM-2"),
    logL=sapply(list(ard,hrm,hrm2),logLik),
    AIC(ard,hrm,hrm2))
##      model      logL df      AIC
## ard    ARD -99.13165  2 202.2633
## hrm  HRM-1 -85.36790  4 178.7358
## hrm2 HRM-2 -83.06264  8 182.1253

This shows us that the best-supported model is the HRM-1, one hidden-state model. This makes perfect sense because it was the model we used for simulation. Cool!

No comments:

Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.