Yesterday morning I posted on about a variable-rate M*k* discrete trait evolution model in which branch-wise rate variation was modeled using a discretized \(\Gamma\)-distribution, much in the same way as we approximate among-site rate variation in nucleotide models for phylogeny inference. I illustrated – by way of a (highly rigorous) *single* numerical simulation that we could obtain an estimate of the shape parameter of the generating distribution of rate heterogeneity, \(\alpha\). (The \(\Gamma\) distribution is normally a two parameter distribution; however, as with nucleotide models, we’re going to keep the mean of the distribution constant at 1.0, so what we’re really modeling is the relative rate of evolution as it changes from edge to edge in our tree.)

I mentioned in my earlier post the one of the next logical steps might be to add ancestral state estimation – so now I have done that. I theory, I think you’d expect that if the data arose with \(\Gamma\)-distributed rate heterogeneity, then our ancestral states should be more *accurate* if we account for it in our model.

The only way to find out if that’s true is by trying.

What I’m going to do to test the idea out is simulate data with *high* rate heterogeneity, \(\alpha\) = 0.1, to give myself the best shot at finding an effect – then reconstruct marginal ancestral states either under the standard M*k* model and under our \(\Gamma\) rate heterogeneity model. To measure accuracy, I’m just going to compute the mean probability of the true state. I’m going to generate data with three states, but that arose under an equal rate of transitions between different states.

Here’s what one such simulation could look like:

```
library(phytools)
packageVersion("phytools")
```

```
## [1] '2.0.9'
```

```
## simulate the tree
tree<-pbtree(n=200)
## set the Q transition matrix
Q<-matrix(c(-2,1,1,
1,-2,1,
1,1,-2),3,3,byrow=TRUE,
dimnames=list(0:2,0:2))
## set alpha
alpha<-0.1
## randomly sample the rates
rates<-rgamma(n=nrow(tree$edge),
alpha,alpha)
## create the tree for simulation
simtree<-tree
simtree$edge.length<-simtree$edge.length*rates
## simulate using phytools::sim.Mk
x<-sim.Mk(simtree,Q,internal=TRUE)
## separate out the known ancestors and the tip states
anc<-x[1:tree$Nnode+Ntip(tree)]
x<-x[tree$tip.label]
## fit the standard Mk model
fit_mk<-fitMk(tree,x,model="ER",rand_start=TRUE)
print(fit_mk)
```

```
## Object of class "fitMk".
##
## Fitted (or set) value of Q:
## 0 1 2
## 0 -0.651424 0.325712 0.325712
## 1 0.325712 -0.651424 0.325712
## 2 0.325712 0.325712 -0.651424
##
## Fitted (or set) value of pi:
## 0 1 2
## 0.333333 0.333333 0.333333
## due to treating the root prior as (a) flat.
##
## Log-likelihood: -188.023239
##
## Optimization method used was "nlminb"
##
## R thinks it has found the ML solution.
```

```
## now fit a Gamma model
fit_gamma<-fitgammaMk(tree,x,model="ER",
rand_start=TRUE,min.alpha=0.01)
print(fit_gamma)
```

```
## Object of class "fitgammaMk".
##
## Fitted (or set) value of Q:
## 0 1 2
## 0 -1.278864 0.639432 0.639432
## 1 0.639432 -1.278864 0.639432
## 2 0.639432 0.639432 -1.278864
##
## Fitted (or set) value of alpha rate heterogeneity
## (with 4 rate categories): 0.259345
##
## Fitted (or set) value of pi:
## 0 1 2
## 0.333333 0.333333 0.333333
## due to treating the root prior as (a) flat.
##
## Log-likelihood: -185.823387
##
## Optimization method used was "nlminb"
##
## R thinks it has found the ML solution.
```

Readers can see I updated `min.alpha`

because it defaults to `min.alpha=0.1`

which is equal to our generating value of \(\alpha\)!

```
## estimate ancestral states under Mk model
mk_ace<-ancr(fit_mk)
print(mk_ace)
```

```
## Marginal ancestral state estimates:
## 0 1 2
## 201 0.354153 0.279073 0.366774
## 202 0.496022 0.195150 0.308828
## 203 0.488793 0.196273 0.314933
## 204 0.230394 0.111118 0.658488
## 205 0.255303 0.076645 0.668051
## 206 0.008752 0.004443 0.986805
## ...
##
## Log-likelihood = -188.023239
```

```
## estimate ancestral states under Gamma model
gamma_ace<-ancr(fit_gamma)
print(gamma_ace)
```

```
## Marginal ancestral state estimates:
## 0 1 2
## 201 0.347147 0.407001 0.245852
## 202 0.497903 0.170724 0.331373
## 203 0.485363 0.169787 0.344850
## 204 0.230728 0.070164 0.699109
## 205 0.295011 0.051876 0.653112
## 206 0.017970 0.006795 0.975234
## ...
##
## Log-likelihood = -185.823387
```

To measure accuracy, I’m going to first convert my *known* ancestral states to a binary matrix.

```
X<-to.matrix(anc,levels(anc))
head(X)
```

```
## 0 1 2
## 201 1 0 0
## 202 1 0 0
## 203 1 0 0
## 204 0 0 1
## 205 1 0 0
## 206 0 0 1
```

This makes it easy to visually compare our two sets of estimates to their known, true values. Remember, the true states are represented as *binary* YES/NO, while our estimates are probabilities. (This is why we’ll see two columns of points at 0 & 1.)

```
par(mfrow=c(1,2))
plot(to.matrix(anc,levels(anc)),mk_ace$ace,pch=16,
col=make.transparent("blue",0.5),bty="n",las=1,
main="Mk model",font.main=1,xlab="generating",
ylab="estimated")
lines(c(0,1),c(0,1),lty="dotted")
legend(x=0.4,y=0.15,c("1:1","best fit"),
lty=c("dotted","solid"),lwd=c(1,2),bty="n")
clip(x1=0,x2=1,y1=0,y2=1)
abline(lm(as.vector(mk_ace$ace)~as.vector(to.matrix(anc,
levels(anc)))),lwd=2)
plot(to.matrix(anc,levels(anc)),gamma_ace$ace,pch=16,
col=make.transparent("blue",0.5),bty="n",las=1,
main=expression(paste(Gamma," model")),font.main=1,
xlab="generating",ylab="estimated")
lines(c(0,1),c(0,1),lty="dotted")
legend(x=0.4,y=0.15,c("1:1","best fit"),
lty=c("dotted","solid"),lwd=c(1,2),bty="n")
clip(x1=0,x2=1,y1=0,y2=1)
abline(lm(as.vector(gamma_ace$ace)~as.vector(to.matrix(anc,
levels(anc)))),lwd=2)
```

The closer our solid bar to the dotted 1:1 line, the more faithful our estimates to the true values.

Now, let’s compute a summary measure of the accuracy of our estimates under the constant-rate M*k* model, as previously defined. The closer this value is to 1, the more accurate our estimates.

```
## accuracy of Mk model estimates
1-(1/tree$Nnode)*sum(
(X-mk_ace$ace)[cbind(1:nrow(X),apply(X,1,
function(x) which(x==1)))])
```

```
## [1] 0.7146536
```

And finally, under our \(\Gamma\)-distributed variable rate model.

```
1-(1/tree$Nnode)*sum(
(X-gamma_ace$ace)[cbind(1:nrow(X),apply(X,1,
function(x) which(x==1)))])
```

```
## [1] 0.7431883
```

Hopefully, this shows us that our estimates obtained under the variable-rate generating model are more accurate… or maybe it doesn’t: this is just one simulation. To get a better picture of what’s going on, let’s try 200 simulations of the same type!

```
library(foreach)
## register cluster for parallelized optimization
## iterations
niter<-10
ncores<-min(c(parallel::detectCores()-2,niter))
mc<-parallel::makeCluster(ncores,type="PSOCK")
doParallel::registerDoParallel(cl=mc)
## start simulations
nsim<-200
mk_accuracy<-gamma_accuracy<-vector()
for(i in 1:nsim){
tree<-pbtree(n=200)
Q<-matrix(c(-2,1,1,
1,-2,1,
1,1,-2),3,3,byrow=TRUE,
dimnames=list(0:2,0:2))
alpha<-0.1
rates<-rgamma(n=nrow(tree$edge),
alpha,alpha)
simtree<-tree
simtree$edge.length<-simtree$edge.length*rates
x<-sim.Mk(simtree,Q,internal=TRUE)
anc<-x[1:tree$Nnode+Ntip(tree)]
x<-x[tree$tip.label]
fit_mk<-fitMk(tree,x,model="ER",rand_start=TRUE)
fits<-foreach(i=1:niter)%dopar%{
phytools::fitgammaMk(tree,x,model="ER",
rand_start=TRUE,min.alpha=0.01)
}
logL<-sapply(fits,logLik)
fit_gamma<-fits[[which(logL==max(logL))[1]]]
mk_ace<-ancr(fit_mk)
gamma_ace<-ancr(fit_gamma)
X<-to.matrix(anc,levels(anc))
mk_accuracy[i]<-1-(1/tree$Nnode)*sum(
(X-mk_ace$ace)[cbind(1:nrow(X),apply(X,1,
function(x) which(x==1)))])
gamma_accuracy[i]<-1-(1/tree$Nnode)*sum(
(X-gamma_ace$ace)[cbind(1:nrow(X),apply(X,1,
function(x) which(x==1)))])
}
parallel::stopCluster(cl=mc)
```

Finally, let’s graph the *difference* in accuracy between the \(\Gamma\) and M*K* models. If the \(\Gamma\) model has better accuracy, the difference should be *positive*.

```
hist(gamma_accuracy-mk_accuracy,las=1,breaks=20,
main="",xlab=expression(paste("Accuracy of ",Gamma,
" model compared to Mk")))
```

Of course, we ought to do more simulations – but this indicates a strong trend towards higher accuracy of the *generating* model (our \(\Gamma\) rate heterogeneity model) compared to the standard M*k*. Neat.

## No comments:

## Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.