Tuesday, July 26, 2022

Running make.simmap in parallel on large trees using the snow package

A friend & colleague asked on Twitter about how stochastic character mapping using phytools::make.simmap scales with larger trees.

Nearly five years ago, I posted about parallelizing make.simmap using the R package snow. I thought it might be a good moment to revisit that post.

For this example, I'll use the same phylogeny & dataset from Ramm et al. (2019) that I used on my blog of yesterday & that also features in an exercise of my upcoming book with Luke Harmon.

Let's load our packages.

library(phytools)
library(geiger)
library(snow)

In this example, I'm going to use the latest phytools version which contains one small fix to allow an object of class "Qmatrix" to be passed as a fixed value of Q to make.simmap. (Otherwise we need to pass a simple matrix, which can be obtained from an object of class "Qmatrix" using object<-unclass(object).)

packageVersion("phytools")
## [1] '1.1.3'

Next, we can read our tree and data from file. This time, let's analyze the trait "habitat" instead of "tail.spines".

## now read data from file
lizard.tree<-read.nexus(file="http://www.phytools.org/Rbook/7/lizard_tree.nex")
lizard.data<-read.csv(file="http://www.phytools.org/Rbook/7/lizard_spines.csv",
    row.names=1,stringsAsFactors=TRUE)
## run name.check and prune tree to match input data.
chk<-name.check(lizard.tree,lizard.data)
summary(chk)
## 3504 taxa are present in the tree but not the data:
##     Ablepharus_chernovi,
##     Ablepharus_kitaibelii,
##     Ablepharus_pannonicus,
##     Abronia_aurita,
##     Abronia_campbelli,
##     Abronia_chiszari,
##     ....
## 
## To see complete list of mis-matched taxa, print object.
lizard.tree<-drop.tip(lizard.tree,chk$tree_not_data)
## pull out the environmental trait "habitat"
habitat<-setNames(lizard.data$habitat,
    rownames(lizard.data))

For fun, let's plot our data in a fan-style, just as we did yesterday.

plotTree(lizard.tree,type="fan",lwd=1,ftype="off")
obj<-get("last_plot.phylo",envir=.PlotPhyloEnv)
n<-Ntip(lizard.tree)
col.hab<-setNames(c("darkgreen","grey"),levels(habitat))
par(lend=3)
for(i in 1:Ntip(lizard.tree)){
    cc<-if(obj$xx[i]>0) 5 else -5
    th<-atan(obj$yy[i]/obj$xx[i])
    segments(obj$xx[i],obj$yy[i],
        obj$xx[i]+cc*cos(th),
        obj$yy[i]+cc*sin(th),lwd=4,
        col=col.hab[habitat[lizard.tree$tip.label[i]]])
}
legend("topleft",legend=levels(habitat),
    pch=15,col=col.hab,
    pt.cex=1.5,cex=0.8,bty="n")

plot of chunk unnamed-chunk-4

So far so good.

Our next step is to obtain the MLE of Q. To do that, I'm going to fit an "ARD" model to our tree and data vector using phytools::fitMk as follows. We also think it's safe to assume that the ancestral lizard in our group was non-saxicolous (non rock-dwelling), rather than the converse, so we're going to fix pi=c(1,0).

fit.habitat<-fitMk(lizard.tree,habitat,model="ARD",
    pi=c(1,0))
mle.Q<-as.Qmatrix(fit.habitat)
mle.Q
## Estimated Q matrix:
##                non-saxicolous   saxicolous
## non-saxicolous   -0.002361909  0.002361909
## saxicolous        0.022035900 -0.022035900

Great. Now, let's try to obtain 200 stochastic character maps using make.simmap but running on a single core.

system.time(smap_habitat.sc<-make.simmap(lizard.tree,habitat,
    Q=mle.Q,pi=c(1,0),nsim=200))
## make.simmap is sampling character histories conditioned on
## the transition matrix
## 
## Q =
##                non-saxicolous   saxicolous
## non-saxicolous   -0.002361909  0.002361909
## saxicolous        0.022035900 -0.022035900
## (specified by the user);
## and (mean) root node prior probabilities
## pi =
## non-saxicolous     saxicolous 
##              1              0
## Done.
##    user  system elapsed 
##   63.98    0.51   64.64

This shows that our 200 stochastic character maps take >60s to run.

summary(smap_habitat.sc)
## 200 trees with a mapped discrete character with states:
##  non-saxicolous, saxicolous 
## 
## trees have 110.05 changes between states on average
## 
## changes are of the following types:
##      non-saxicolous,saxicolous saxicolous,non-saxicolous
## x->y                    31.845                    78.205
## 
## mean total time spent in each state is:
##      non-saxicolous   saxicolous    total
## raw    1.333930e+04 3560.3273419 16899.63
## prop   7.893251e-01    0.2106749     1.00

Next, let's run our analysis in parallel using makeSOCKcluster. My computer has four physical cores and a total of 8 logical processors. It turns out that it doesn't really seem to matter whether I set my number of cores to 4 or 8. Here, I use 8. (What does matter is that I adjust my Windows Power Settings for “optimal performance.” If I don't do that, Windows will throttle CPU use to a lower level.

cl<-makeSOCKcluster(rep("localhost",8))
system.time(smap_habitat.mc<-clusterApply(cl,
    x=replicate(8,habitat,simplify=FALSE),
    fun=make.simmap,tree=lizard.tree,Q=mle.Q,
    pi=c(1,0),nsim=25))
##    user  system elapsed 
##    0.22    0.06   32.14
smap_habitat.mc<-do.call(c,smap_habitat.mc)
if(!("multiSimmap"%in%class(smap_habitat.mc))) 
    class(smap_habitat.mc)<-c("multiSimmap",
        class(smap_habitat.mc))
stopCluster(cl)
summary(smap_habitat.mc)
## 200 trees with a mapped discrete character with states:
##  non-saxicolous, saxicolous 
## 
## trees have 111.435 changes between states on average
## 
## changes are of the following types:
##      non-saxicolous,saxicolous saxicolous,non-saxicolous
## x->y                     31.46                    79.975
## 
## mean total time spent in each state is:
##      non-saxicolous   saxicolous    total
## raw    1.329105e+04 3608.5802300 16899.63
## prop   7.864698e-01    0.2135302     1.00

Neat, we can already more or less see that we get the same results whether we run on a single core, or on all 8 processors, except the latter is over twice as fast!

Finally, let's graph our results.

habitat.dMap<-densityMap(smap_habitat.mc,plot=FALSE)
## sorry - this might take a while; please be patient
habitat.dMap<-setMap(habitat.dMap,col.hab)
pp<-summary(smap_habitat.mc)$ace[1:lizard.tree$Nnode,]
plot(habitat.dMap,type="fan",ftype="off",lwd=2,legend=FALSE,
    ylim=max(nodeHeights(lizard.tree))*c(-1,1.1))
obj<-get("last_plot.phylo",envir=.PlotPhyloEnv)
n<-Ntip(lizard.tree)
par(lend=3)
for(i in 1:Ntip(lizard.tree)){
    cc<-if(obj$xx[i]>0) 5 else -5
    th<-atan(obj$yy[i]/obj$xx[i])
    segments(obj$xx[i],obj$yy[i],
        obj$xx[i]+cc*cos(th),
        obj$yy[i]+cc*sin(th),lwd=4,
        col=col.hab[habitat[lizard.tree$tip.label[i]]])
}
add.color.bar(max(nodeHeights(lizard.tree)),habitat.dMap$cols,
    title="PP(state = \"saxicolous\")",
    prompt=FALSE,lwd=6,outline=FALSE,x=0.95*par()$usr[1],
    y=0.9*par()$usr[4],subtitle="",fsize=0.8)
h<-max(nodeHeights(lizard.tree))
ii<-order(rowSums(pp^2),decreasing=TRUE)
for(i in ii){
    plotrix::floating.pie(obj$xx[i+Ntip(lizard.tree)],
        obj$yy[i+Ntip(lizard.tree)],
        radius=if(max(pp[i,])<=0.95) 0.03*h else 0.01*h,
        x=pp[i,],col=col.hab,
        border=if(max(pp[i,])<=0.95) "black" else "transparent")
}
legend(x=0.98*par()$usr[1],y=0.9*par()$usr[4],
    levels(habitat),pch=21,pt.bg=col.hab,
    pt.cex=2,bty="n",cex=0.8)

plot of chunk unnamed-chunk-9

Looks like it worked!

No comments:

Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.