Friday, December 8, 2023

Ancestral state reconstruction with Γ-distributed rate heterogeneity among edges

Yesterday morning I posted on about a variable-rate Mk discrete trait evolution model in which branch-wise rate variation was modeled using a discretized \(\Gamma\)-distribution, much in the same way as we approximate among-site rate variation in nucleotide models for phylogeny inference. I illustrated – by way of a (highly rigorous) single numerical simulation that we could obtain an estimate of the shape parameter of the generating distribution of rate heterogeneity, \(\alpha\). (The \(\Gamma\) distribution is normally a two parameter distribution; however, as with nucleotide models, we’re going to keep the mean of the distribution constant at 1.0, so what we’re really modeling is the relative rate of evolution as it changes from edge to edge in our tree.)

I mentioned in my earlier post the one of the next logical steps might be to add ancestral state estimation – so now I have done that. I theory, I think you’d expect that if the data arose with \(\Gamma\)-distributed rate heterogeneity, then our ancestral states should be more accurate if we account for it in our model.

The only way to find out if that’s true is by trying.

What I’m going to do to test the idea out is simulate data with high rate heterogeneity, \(\alpha\) = 0.1, to give myself the best shot at finding an effect – then reconstruct marginal ancestral states either under the standard Mk model and under our \(\Gamma\) rate heterogeneity model. To measure accuracy, I’m just going to compute the mean probability of the true state. I’m going to generate data with three states, but that arose under an equal rate of transitions between different states.

Here’s what one such simulation could look like:

library(phytools)
packageVersion("phytools")
## [1] '2.0.9'
## simulate the tree
tree<-pbtree(n=200)
## set the Q transition matrix
Q<-matrix(c(-2,1,1,
  1,-2,1,
  1,1,-2),3,3,byrow=TRUE,
  dimnames=list(0:2,0:2))
## set alpha
alpha<-0.1
## randomly sample the rates
rates<-rgamma(n=nrow(tree$edge),
  alpha,alpha)
## create the tree for simulation
simtree<-tree
simtree$edge.length<-simtree$edge.length*rates
## simulate using phytools::sim.Mk
x<-sim.Mk(simtree,Q,internal=TRUE)
## separate out the known ancestors and the tip states
anc<-x[1:tree$Nnode+Ntip(tree)]
x<-x[tree$tip.label]
## fit the standard Mk model
fit_mk<-fitMk(tree,x,model="ER",rand_start=TRUE)
print(fit_mk)
## Object of class "fitMk".
## 
## Fitted (or set) value of Q:
##           0         1         2
## 0 -0.651424  0.325712  0.325712
## 1  0.325712 -0.651424  0.325712
## 2  0.325712  0.325712 -0.651424
## 
## Fitted (or set) value of pi:
##        0        1        2 
## 0.333333 0.333333 0.333333 
## due to treating the root prior as (a) flat.
## 
## Log-likelihood: -188.023239 
## 
## Optimization method used was "nlminb"
## 
## R thinks it has found the ML solution.
## now fit a Gamma model
fit_gamma<-fitgammaMk(tree,x,model="ER",
  rand_start=TRUE,min.alpha=0.01)
print(fit_gamma)
## Object of class "fitgammaMk".
## 
## Fitted (or set) value of Q:
##           0         1         2
## 0 -1.278864  0.639432  0.639432
## 1  0.639432 -1.278864  0.639432
## 2  0.639432  0.639432 -1.278864
## 
## Fitted (or set) value of alpha rate heterogeneity
## (with 4 rate categories): 0.259345
## 
## Fitted (or set) value of pi:
##        0        1        2 
## 0.333333 0.333333 0.333333 
## due to treating the root prior as (a) flat.
## 
## Log-likelihood: -185.823387 
## 
## Optimization method used was "nlminb"
## 
## R thinks it has found the ML solution.

Readers can see I updated min.alpha because it defaults to min.alpha=0.1 which is equal to our generating value of \(\alpha\)!

## estimate ancestral states under Mk model
mk_ace<-ancr(fit_mk)
print(mk_ace)
## Marginal ancestral state estimates:
##            0        1        2
## 201 0.354153 0.279073 0.366774
## 202 0.496022 0.195150 0.308828
## 203 0.488793 0.196273 0.314933
## 204 0.230394 0.111118 0.658488
## 205 0.255303 0.076645 0.668051
## 206 0.008752 0.004443 0.986805
## ...
## 
## Log-likelihood = -188.023239
## estimate ancestral states under Gamma model
gamma_ace<-ancr(fit_gamma)
print(gamma_ace)
## Marginal ancestral state estimates:
##            0        1        2
## 201 0.347147 0.407001 0.245852
## 202 0.497903 0.170724 0.331373
## 203 0.485363 0.169787 0.344850
## 204 0.230728 0.070164 0.699109
## 205 0.295011 0.051876 0.653112
## 206 0.017970 0.006795 0.975234
## ...
## 
## Log-likelihood = -185.823387

To measure accuracy, I’m going to first convert my known ancestral states to a binary matrix.

X<-to.matrix(anc,levels(anc))
head(X)
##     0 1 2
## 201 1 0 0
## 202 1 0 0
## 203 1 0 0
## 204 0 0 1
## 205 1 0 0
## 206 0 0 1

This makes it easy to visually compare our two sets of estimates to their known, true values. Remember, the true states are represented as binary YES/NO, while our estimates are probabilities. (This is why we’ll see two columns of points at 0 & 1.)

  par(mfrow=c(1,2))
  plot(to.matrix(anc,levels(anc)),mk_ace$ace,pch=16,
    col=make.transparent("blue",0.5),bty="n",las=1,
    main="Mk model",font.main=1,xlab="generating",
    ylab="estimated")
  lines(c(0,1),c(0,1),lty="dotted")
  legend(x=0.4,y=0.15,c("1:1","best fit"),
    lty=c("dotted","solid"),lwd=c(1,2),bty="n")
  clip(x1=0,x2=1,y1=0,y2=1)
  abline(lm(as.vector(mk_ace$ace)~as.vector(to.matrix(anc,
    levels(anc)))),lwd=2)
  plot(to.matrix(anc,levels(anc)),gamma_ace$ace,pch=16,
    col=make.transparent("blue",0.5),bty="n",las=1,
    main=expression(paste(Gamma," model")),font.main=1,
    xlab="generating",ylab="estimated")
  lines(c(0,1),c(0,1),lty="dotted")
  legend(x=0.4,y=0.15,c("1:1","best fit"),
    lty=c("dotted","solid"),lwd=c(1,2),bty="n")
  clip(x1=0,x2=1,y1=0,y2=1)
  abline(lm(as.vector(gamma_ace$ace)~as.vector(to.matrix(anc,
    levels(anc)))),lwd=2)

plot of chunk unnamed-chunk-101

The closer our solid bar to the dotted 1:1 line, the more faithful our estimates to the true values.

Now, let’s compute a summary measure of the accuracy of our estimates under the constant-rate Mk model, as previously defined. The closer this value is to 1, the more accurate our estimates.

## accuracy of Mk model estimates
1-(1/tree$Nnode)*sum(
  (X-mk_ace$ace)[cbind(1:nrow(X),apply(X,1,
  function(x) which(x==1)))])
## [1] 0.7146536

And finally, under our \(\Gamma\)-distributed variable rate model.

1-(1/tree$Nnode)*sum(
  (X-gamma_ace$ace)[cbind(1:nrow(X),apply(X,1,
  function(x) which(x==1)))])
## [1] 0.7431883

Hopefully, this shows us that our estimates obtained under the variable-rate generating model are more accurate… or maybe it doesn’t: this is just one simulation. To get a better picture of what’s going on, let’s try 200 simulations of the same type!

library(foreach)
## register cluster for parallelized optimization
## iterations
niter<-10
ncores<-min(c(parallel::detectCores()-2,niter))
mc<-parallel::makeCluster(ncores,type="PSOCK")
doParallel::registerDoParallel(cl=mc)
## start simulations
nsim<-200
mk_accuracy<-gamma_accuracy<-vector()
for(i in 1:nsim){
  tree<-pbtree(n=200)
  Q<-matrix(c(-2,1,1,
    1,-2,1,
    1,1,-2),3,3,byrow=TRUE,
    dimnames=list(0:2,0:2))
  alpha<-0.1
  rates<-rgamma(n=nrow(tree$edge),
    alpha,alpha)
  simtree<-tree
  simtree$edge.length<-simtree$edge.length*rates
  x<-sim.Mk(simtree,Q,internal=TRUE)
  anc<-x[1:tree$Nnode+Ntip(tree)]
  x<-x[tree$tip.label]
  fit_mk<-fitMk(tree,x,model="ER",rand_start=TRUE)
  fits<-foreach(i=1:niter)%dopar%{
    phytools::fitgammaMk(tree,x,model="ER",
      rand_start=TRUE,min.alpha=0.01)
  }
  logL<-sapply(fits,logLik)
  fit_gamma<-fits[[which(logL==max(logL))[1]]]
  mk_ace<-ancr(fit_mk)
  gamma_ace<-ancr(fit_gamma)
  X<-to.matrix(anc,levels(anc))
  mk_accuracy[i]<-1-(1/tree$Nnode)*sum(
    (X-mk_ace$ace)[cbind(1:nrow(X),apply(X,1,
    function(x) which(x==1)))])
  gamma_accuracy[i]<-1-(1/tree$Nnode)*sum(
    (X-gamma_ace$ace)[cbind(1:nrow(X),apply(X,1,
    function(x) which(x==1)))])
}
parallel::stopCluster(cl=mc)

Finally, let’s graph the difference in accuracy between the \(\Gamma\) and MK models. If the \(\Gamma\) model has better accuracy, the difference should be positive.

hist(gamma_accuracy-mk_accuracy,las=1,breaks=20,
  main="",xlab=expression(paste("Accuracy of ",Gamma,
    " model compared to Mk")))

plot of chunk unnamed-chunk-106

Of course, we ought to do more simulations – but this indicates a strong trend towards higher accuracy of the generating model (our \(\Gamma\) rate heterogeneity model) compared to the standard Mk. Neat.

No comments:

Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.