Wednesday, April 3, 2024

Another post about running multiple optimization iterations of fitMk in parallel using the foreach package

This is a “quick & dirty” tutorial on how to run multiple optimizations of the Mk discrete character evolution model with fitMk in parallel using the foreach R package. I know I’ve posted similar things previously on this blog, but I thought this could be handy as a reference.

We should start by loading the packages we need. If a recent version of phytools is installed, then you should also have doParallel and foreach.

library(phytools)
library(doParallel)
library(foreach)

For this demo, I’ll use a phylogeny of Anolis lizard and the discrete character of ecomorphological category (“ecomorph”).

These data are available on our book website. Here, I’ll just read in the phylogeny & data directly as follows.

anolis_tree<-read.tree(
  file="http://www.phytools.org/Rbook/1/Anolis.tre")
anolis_tree
## 
## Phylogenetic tree with 100 tips and 99 internal nodes.
## 
## Tip labels:
##   ahli, allogus, rubribarbus, imias, sagrei, bremeri, ...
## 
## Rooted; includes branch lengths.
ecomorph_data<-read.csv(
  file="http://www.phytools.org/Rbook/1/ecomorph.csv",
  row.names=1,stringsAsFactors=TRUE)
head(ecomorph_data,20)

In the data, "TG" means “trunk-ground”, "GB" grass-bush, "Tw" twig, and so on. These each indicate the microhabitat specialization of each taxon.

There are some mismatches between the tree & data, so we can reconcile the two using geiger::name.check.

chk<-geiger::name.check(anolis_tree,
  ecomorph_data)
summary(chk)
## 18 taxa are present in the tree but not the data:
##     argenteolus,
##     argillaceus,
##     barbatus,
##     barbouri,
##     bartschi,
##     centralis,
##     ....
## 
## To see complete list of mis-matched taxa, print object.
anolis_tree<-drop.tip(anolis_tree,
  chk$tree_not_data)

Now, just for fun, let’s graph our tree & data using my new function plotFanTree.wTraits. (This step requires readers to update phytools from GitHub.)

cols<-plotFanTree.wTraits(anolis_tree,ecomorph_data,
  part=0.5)
legend("topleft",names(cols[[1]]),pch=15,col=cols[[1]],
  pt.cex=1.5,cex=0.8,bty="n",inset=0.05)

plot of chunk unnamed-chunk-11 Great. So far so good.

Since ecomorph_data is still a data frame, and fitMk takes a factor (or character, or numeric) vector as input, let’s pull out our character of interest as follows.

ecomorph<-setNames(ecomorph_data$ecomorph,
  rownames(ecomorph_data))

Finally, to our parallelization code.

I don’t need to do this, but for the example I’m using randomly-selected optimizers (optim or nlminb) and randomly optimizing on the log-scale or the original scale. I must allow the optimizer to pick random starting values (rand_start=TRUE) for each optimization iteration – otherwise our optimization would be fully deterministic each time!

Go.

niter<-20 ## set iterations
## randomly sample optimization methods
opt.method<-sample(c("nlminb","optim"),niter,
  replace=TRUE)
## randomly sample log/linear scale
logscale<-sample(c(TRUE,FALSE),niter,replace=TRUE)
## set ncores and open cluster
ncores<-min(niter,detectCores()-1)
mc<-makeCluster(ncores,type="PSOCK")
registerDoParallel(cl=mc)
## run niter optimizations in parallel
ard_fits<-foreach(i=1:niter)%dopar%{
  phytools::fitMk(anolis_tree,ecomorph,model="ARD",
    pi="fitzjohn",rand_start=TRUE,
    opt.method=opt.method[i],logscale=logscale[i])
}
stopCluster(mc) ## stop cluster

The number of optimization iterations (niter = 20) in this example is arbitrary, FWIW. Obviously, the more we run, the more confident we can be that the correct solution has been found!

To start, let’s look at all the log-likelihoods from our set of niter fitted models.

logL<-sapply(ard_fits,logLik)
logL
##  [1] -65.02146 -68.22627 -68.22627 -68.01720 -65.02146 -68.01720 -65.02146 -66.13570 -65.61117
## [10] -65.61115 -68.22627 -68.01716 -68.01722 -68.22628 -68.01716 -65.61116 -68.01718 -68.01716
## [19] -65.61114 -68.01716

This should show us that we have not converged to the same solution for each of our niter optimizations!

Next, let’s graph just the top twelve fitted models with their log-likelihoods – sorted from best to worst.

ii<-order(logL,decreasing=TRUE)[1:12]
top_twelve<-ard_fits[ii]
par(mfrow=c(3,4))
for(i in 1:min(niter,12)){
  plot(top_twelve[[i]],width=TRUE,mar=rep(0.1,4))
  mtext(paste("log(L) =",
    signif(logLik(top_twelve[[i]]),6)),adj=0.05,
    line=-1.2)
}

plot of chunk unnamed-chunk-15

Interestingly (but not all that surprisingly) these different fitted models show substantial parameter instability between models with highly similar log-likelihoods! This is very likely a byproduct of the very high number of parameters to be estimated in the "ARD" model here (thirty!), relative to the amount of data we have about the evolutionary process from an 82-taxon tree, and should definitely undermine our confidence in the specific “best” model we encountered.

Nonetheless, here it is:

best_ard<-ard_fits[[which.max(logL)]]
plot(best_ard,width=TRUE,color=TRUE,
  xlim=c(-2,1),ylim=c(-1,1))

plot of chunk unnamed-chunk-16

That’s it, folks.

No comments:

Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.