Over the past several months Luke Harmon and I have been working on a project
that has led me to make lots of different updates and additions to *phytools*.

This afternoon, I submitted a new *phytools* version to CRAN and
it is already available (only as source; Windows & Mac binaries usually take a few days to be built).

I haven't had time in recent weeks to post about the new package updates that are in this current version. I'm going
to try to make a few blog entries now, however, that describe some of the updates and new features of *phytools*.

The first of these that I'll mention is a new function (called
`fitHRM`

) that can be used to fit the
hidden-rate-model of Beaulieu et al. (2013).

This model can already be fit to discrete character data in R using the package
*corHMM*. In fact, I suspect that the implementation of *corHMM* is more
robust and scales better to large trees. (I have not yet tried to use `fitHRM`

with very large phylogeny - but I know
that `corHMM`

in the *corHMM* package has been used in empirical studies involving up to 1000s of taxa.)

The idea of this model is that we observe evolution between the states in a certain state space (e.g., *a* ↔
*b* ↔ *c*, etc.), but there also exists unobserved conditions for the each state (or some states, see below)
with different rates of change to other states (e.g., *a'*, *b'*, *c'*, and so on).

A simple example of this might be the case of a binary trait in which condition 0 (say) existed in two hidden states:
0' (hot) and 0'' (cold). When in observed condition 0 has the hidden state 0' the change 0 ↔ 1 is allowed, but
when it's in state 0'' the change 0 ↔ 1 *cannot* occur.

To start, let's simulate under this exactly model & see what evolution looks like!

```
packageVersion("phytools")
```

```
## [1] '0.7.70'
```

```
library(phytools)
tree<-pbtree(n=200,scale=1)
Q<-matrix(c(
-0.5,0.5,0.0,
0.5,-4.5,4.0,
0.0,4.0,-4.0),3,3,byrow=TRUE,
dimnames=list(c("0**","0*","1"),
c("0**","0*","1")))
x<-sim.Mk(tree,Q,anc="0**")
plotTree(tree,type="fan",lwd=1,ftype="off")
pp<-get("last_plot.phylo",envir=.PlotPhyloEnv)
cols<-setNames(c("grey","grey","black"),
levels(x))
pch<-setNames(c(21,23,21),levels(x))
cex<-setNames(c(1.3,1.1,1.3),levels(x))
points(pp$xx[1:Ntip(tree)],pp$yy[1:Ntip(tree)],
pch=pch[x],bg=cols[x],cex=cex[x])
legend("topleft",c("0","1"),pch=22,pt.cex=1.5,
pt.bg=cols[c("0*","1")],
title="Observed state")
legend("topright",levels(x),pch=pch,
pt.cex=c(1.7,1.5,1.5),
pt.bg=cols,title="Hidden state")
```

The idea is we only “see” two values for our trait at each tip in the tree: 0 or 1. Two different conditions for state 0 are unobserved: 0* with a high rate of evolution to 1; and 0** that can't change to 1 (except by way of changing first to state 0*).

The effect of this hidden-rate heterogeneity on the trait distribution of 0 & 1 on our tree is extremely evident.

For instance, in one part of the tree we see a large, deeply divergent clade that is entirely in state 0 (for hidden state 0**). In other regions of the phylogeny 0 & 1 change back and forth easily (for hidden state 0*).

Let's merge our 0* & 0** into a single character (0) to see if we can fit a hidden-rates model to our data using the
new *phytools* function `fitHRM`

.

```
y<-x
levels(y)<-c(0,0,1)
head(y,100)
```

```
## t17 t183 t184 t37 t38 t25 t151 t152 t98 t93 t94 t13 t54 t55 t39 t83 t190 t191
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## t137 t138 t15 t16 t120 t121 t28 t29 t148 t149 t111 t69 t34 t35 t119 t150 t167 t168
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## t169 t170 t186 t187 t73 t74 t31 t56 t57 t134 t135 t112 t117 t118 t8 t9 t12 t67
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## t68 t32 t178 t179 t3 t81 t82 t70 t79 t80 t45 t75 t76 t49 t20 t124 t171 t172
## 0 0 0 0 1 0 1 1 1 1 1 0 1 1 1 1 0 1
## t51 t52 t2 t36 t104 t105 t86 t19 t71 t72 t64 t63 t115 t116 t175 t176 t159 t160
## 0 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 0 0
## t113 t128 t129 t23 t195 t196 t177 t173 t174 t41
## 0 1 1 1 1 1 0 0 0 0
## Levels: 0 1
```

(`y`

is the trait vector that we observe, remember. We have no idea whether each tip of the tree is in hidden-state
0* or 0**.)

Now let's fit our model. For our hidden rates model we imagine that there are two rate categories (i.e., two hidden
states) for 0, but only one rate category (no hidden states) for 1. We specify this model design using the argument
`ncat=c(2,1)`

. Note that this model explicitly prohibits transitions from 0** (actually, we're going to call them
`0`

and `0*`

in the fitted model) to 1.

```
hrm<-fitHRM(tree,y,ncat=c(2,1),pi="fitzjohn")
```

```
##
## This is the design matrix of the fitted model.
## Does it make sense?
##
## 0 0* 1
## 0 0 1 2
## 0* 3 0 0
## 1 4 0 0
##
## log-likelihood from current iteration: -85.3678989795076
## --- Best log-likelihood so far: -85.3678989795076 ---
## log-likelihood from current iteration: -85.3678989789946
## --- Best log-likelihood so far: -85.3678989789946 ---
## log-likelihood from current iteration: -85.3679184420598
## --- Best log-likelihood so far: -85.3678989789946 ---
## log-likelihood from current iteration: -85.3678989790264
## --- Best log-likelihood so far: -85.3678989789946 ---
## log-likelihood from current iteration: -85.3678989790468
## --- Best log-likelihood so far: -85.3678989789946 ---
## log-likelihood from current iteration: -85.3678989856971
## --- Best log-likelihood so far: -85.3678989789946 ---
## log-likelihood from current iteration: -85.3678989861835
## --- Best log-likelihood so far: -85.3678989789946 ---
## log-likelihood from current iteration: -85.3678989883086
## --- Best log-likelihood so far: -85.3678989789946 ---
## log-likelihood from current iteration: -85.3678992525849
## --- Best log-likelihood so far: -85.3678989789946 ---
## log-likelihood from current iteration: -85.367898979139
## --- Best log-likelihood so far: -85.3678989789946 ---
```

```
hrm
```

```
## Object of class "fitHRM".
##
## Observed states: [ 0, 1 ]
## Number of rate categories per state: [ 2, 1 ]
##
## Fitted (or set) value of Q:
## 0 0* 1
## 0 -6.706022 0.580216 6.125806
## 0* 0.000000 0.000000 0.000000
## 1 3.575458 0.000000 -3.575458
##
## Fitted (or set) value of pi:
## 0 0* 1
## 0.707789 0.000000 0.292211
## due to treating the root prior as (a) nuisance.
##
## Log-likelihood: -85.367899
##
## Optimization method used was "nlminb"
```

In addition to the hidden rates model, we can also fit a standard M*k* model. The hidden rates model has this model
as a special case, so we can compare their likelihoods easily.

```
ard<-fitMk(tree,y,model="ARD",pi="fitzjohn")
ard
```

```
## Object of class "fitMk".
##
## Fitted (or set) value of Q:
## 0 1
## 0 -0.245458 0.245458
## 1 1.882550 -1.882550
##
## Fitted (or set) value of pi:
## 0 1
## 0.145707 0.854293
## due to treating the root prior as (a) nuisance.
##
## Log-likelihood: -99.131646
##
## Optimization method used was "nlminb"
```

In addition to the verbal hidden-rates model that we've described, let's also fit a *second* hidden rates model in
which instead of two rate categories for 0 and two for 1 we have two rate categories (that is, two hidden states)
for each observed level of our trait. (This is actually the same model that's fit by the `corHMM`

.)

```
hrm2<-fitHRM(tree,y,ncat=2,pi="fitzjohn")
```

```
##
## This is the design matrix of the fitted model.
## Does it make sense?
##
## 0 0* 1 1*
## 0 0 1 2 0
## 0* 3 0 0 4
## 1 5 0 0 6
## 1* 0 7 8 0
##
## log-likelihood from current iteration: -83.0626425938858
## --- Best log-likelihood so far: -83.0626425938858 ---
## log-likelihood from current iteration: -85.3679000610412
## --- Best log-likelihood so far: -83.0626425938858 ---
## log-likelihood from current iteration: -83.0626429300241
## --- Best log-likelihood so far: -83.0626425938858 ---
## log-likelihood from current iteration: -83.0626392569354
## --- Best log-likelihood so far: -83.0626392569354 ---
## log-likelihood from current iteration: -85.5141689438441
## --- Best log-likelihood so far: -83.0626392569354 ---
## log-likelihood from current iteration: -85.4894151270034
## --- Best log-likelihood so far: -83.0626392569354 ---
## log-likelihood from current iteration: -85.4899853283194
## --- Best log-likelihood so far: -83.0626392569354 ---
## log-likelihood from current iteration: -83.0626412601082
## --- Best log-likelihood so far: -83.0626392569354 ---
## log-likelihood from current iteration: -85.5159297400412
## --- Best log-likelihood so far: -83.0626392569354 ---
## log-likelihood from current iteration: -83.062694408704
## --- Best log-likelihood so far: -83.0626392569354 ---
```

```
hrm
```

```
## Object of class "fitHRM".
##
## Observed states: [ 0, 1 ]
## Number of rate categories per state: [ 2, 1 ]
##
## Fitted (or set) value of Q:
## 0 0* 1
## 0 -6.706022 0.580216 6.125806
## 0* 0.000000 0.000000 0.000000
## 1 3.575458 0.000000 -3.575458
##
## Fitted (or set) value of pi:
## 0 0* 1
## 0.707789 0.000000 0.292211
## due to treating the root prior as (a) nuisance.
##
## Log-likelihood: -85.367899
##
## Optimization method used was "nlminb"
```

This model (by default - we can also set it to be `ordered=TRUE`

to prevent this) allows the transitions `0`

$harr;
`1`

and `0*`

↔ `1*`

, as well as `0`

↔ `0*`

and `1`

↔ `1*`

.

Here's a plot of our four different models so you have a better idea.

```
layout(matrix(c(1,2,3,3),2,2,,byrow=TRUE))
plot(ard)
mtext("a) ARD Mk model",line=1,adj=0,font=3)
plot(hrm)
mtext("b) HRM model with 1 hidden state",line=1,
adj=0,font=3)
plot(hrm2)
mtext("c) HRM model with 2 hidden states",line=1,
adj=0,font=3)
```

Let's put all our models into a table and compare them.

```
data.frame(model=c("ARD","HRM-1","HRM-2"),
logL=sapply(list(ard,hrm,hrm2),logLik),
AIC(ard,hrm,hrm2))
```

```
## model logL df AIC
## ard ARD -99.13165 2 202.2633
## hrm HRM-1 -85.36790 4 178.7358
## hrm2 HRM-2 -83.06264 8 182.1253
```

This shows us that the best-supported model is the HRM-1, one hidden-state model. This makes perfect sense because it was the model we used for simulation. Cool!

## No comments:

## Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.