A friend & colleague asked on Twitter about how stochastic character mapping using
phytools::make.simmap
scales with larger trees.
I've seen a few posts now discussing simmap on medium-large phylo's. Does it scale particularly bad (runtime-wise) with tree size?
— Roi Maor (@Roi_Maor) July 25, 2022
Nearly five years ago, I
posted
about parallelizing make.simmap
using the R package
snow. I thought it might be a good moment to revisit
that post.
For this example, I'll use the same phylogeny & dataset from Ramm et al. (2019) that I used on my blog of yesterday & that also features in an exercise of my upcoming book with Luke Harmon.
Let's load our packages.
library(phytools)
library(geiger)
library(snow)
In this example, I'm going to use the latest phytools version which contains one small fix
to allow an object of class "Qmatrix"
to be passed as a fixed value of Q to make.simmap
.
(Otherwise we need to pass a simple matrix, which can be obtained from an object of class
"Qmatrix"
using object<-unclass(object)
.)
packageVersion("phytools")
## [1] '1.1.3'
Next, we can read our tree and data from file. This time, let's analyze the trait "habitat"
instead of "tail.spines"
.
## now read data from file
lizard.tree<-read.nexus(file="http://www.phytools.org/Rbook/7/lizard_tree.nex")
lizard.data<-read.csv(file="http://www.phytools.org/Rbook/7/lizard_spines.csv",
row.names=1,stringsAsFactors=TRUE)
## run name.check and prune tree to match input data.
chk<-name.check(lizard.tree,lizard.data)
summary(chk)
## 3504 taxa are present in the tree but not the data:
## Ablepharus_chernovi,
## Ablepharus_kitaibelii,
## Ablepharus_pannonicus,
## Abronia_aurita,
## Abronia_campbelli,
## Abronia_chiszari,
## ....
##
## To see complete list of mis-matched taxa, print object.
lizard.tree<-drop.tip(lizard.tree,chk$tree_not_data)
## pull out the environmental trait "habitat"
habitat<-setNames(lizard.data$habitat,
rownames(lizard.data))
For fun, let's plot our data in a fan-style, just as we did yesterday.
plotTree(lizard.tree,type="fan",lwd=1,ftype="off")
obj<-get("last_plot.phylo",envir=.PlotPhyloEnv)
n<-Ntip(lizard.tree)
col.hab<-setNames(c("darkgreen","grey"),levels(habitat))
par(lend=3)
for(i in 1:Ntip(lizard.tree)){
cc<-if(obj$xx[i]>0) 5 else -5
th<-atan(obj$yy[i]/obj$xx[i])
segments(obj$xx[i],obj$yy[i],
obj$xx[i]+cc*cos(th),
obj$yy[i]+cc*sin(th),lwd=4,
col=col.hab[habitat[lizard.tree$tip.label[i]]])
}
legend("topleft",legend=levels(habitat),
pch=15,col=col.hab,
pt.cex=1.5,cex=0.8,bty="n")
So far so good.
Our next step is to obtain the MLE of Q. To do that, I'm going to fit an "ARD"
model to our tree and data vector using phytools::fitMk
as follows. We also think it's safe
to assume that the ancestral lizard in our group was non-saxicolous (non rock-dwelling), rather
than the converse, so we're going to fix pi=c(1,0)
.
fit.habitat<-fitMk(lizard.tree,habitat,model="ARD",
pi=c(1,0))
mle.Q<-as.Qmatrix(fit.habitat)
mle.Q
## Estimated Q matrix:
## non-saxicolous saxicolous
## non-saxicolous -0.002361909 0.002361909
## saxicolous 0.022035900 -0.022035900
Great. Now, let's try to obtain 200 stochastic character maps using make.simmap
but running
on a single core.
system.time(smap_habitat.sc<-make.simmap(lizard.tree,habitat,
Q=mle.Q,pi=c(1,0),nsim=200))
## make.simmap is sampling character histories conditioned on
## the transition matrix
##
## Q =
## non-saxicolous saxicolous
## non-saxicolous -0.002361909 0.002361909
## saxicolous 0.022035900 -0.022035900
## (specified by the user);
## and (mean) root node prior probabilities
## pi =
## non-saxicolous saxicolous
## 1 0
## Done.
## user system elapsed
## 63.98 0.51 64.64
This shows that our 200 stochastic character maps take >60s to run.
summary(smap_habitat.sc)
## 200 trees with a mapped discrete character with states:
## non-saxicolous, saxicolous
##
## trees have 110.05 changes between states on average
##
## changes are of the following types:
## non-saxicolous,saxicolous saxicolous,non-saxicolous
## x->y 31.845 78.205
##
## mean total time spent in each state is:
## non-saxicolous saxicolous total
## raw 1.333930e+04 3560.3273419 16899.63
## prop 7.893251e-01 0.2106749 1.00
Next, let's run our analysis in parallel using makeSOCKcluster
. My computer has four physical
cores and a total of 8 logical processors. It turns out that it doesn't really seem to matter
whether I set my number of cores to 4 or 8. Here, I use 8. (What does matter is that I adjust
my Windows Power Settings for “optimal performance.” If I don't do that, Windows will throttle
CPU use to a lower level.
cl<-makeSOCKcluster(rep("localhost",8))
system.time(smap_habitat.mc<-clusterApply(cl,
x=replicate(8,habitat,simplify=FALSE),
fun=make.simmap,tree=lizard.tree,Q=mle.Q,
pi=c(1,0),nsim=25))
## user system elapsed
## 0.22 0.06 32.14
smap_habitat.mc<-do.call(c,smap_habitat.mc)
if(!("multiSimmap"%in%class(smap_habitat.mc)))
class(smap_habitat.mc)<-c("multiSimmap",
class(smap_habitat.mc))
stopCluster(cl)
summary(smap_habitat.mc)
## 200 trees with a mapped discrete character with states:
## non-saxicolous, saxicolous
##
## trees have 111.435 changes between states on average
##
## changes are of the following types:
## non-saxicolous,saxicolous saxicolous,non-saxicolous
## x->y 31.46 79.975
##
## mean total time spent in each state is:
## non-saxicolous saxicolous total
## raw 1.329105e+04 3608.5802300 16899.63
## prop 7.864698e-01 0.2135302 1.00
Neat, we can already more or less see that we get the same results whether we run on a single core, or on all 8 processors, except the latter is over twice as fast!
Finally, let's graph our results.
habitat.dMap<-densityMap(smap_habitat.mc,plot=FALSE)
## sorry - this might take a while; please be patient
habitat.dMap<-setMap(habitat.dMap,col.hab)
pp<-summary(smap_habitat.mc)$ace[1:lizard.tree$Nnode,]
plot(habitat.dMap,type="fan",ftype="off",lwd=2,legend=FALSE,
ylim=max(nodeHeights(lizard.tree))*c(-1,1.1))
obj<-get("last_plot.phylo",envir=.PlotPhyloEnv)
n<-Ntip(lizard.tree)
par(lend=3)
for(i in 1:Ntip(lizard.tree)){
cc<-if(obj$xx[i]>0) 5 else -5
th<-atan(obj$yy[i]/obj$xx[i])
segments(obj$xx[i],obj$yy[i],
obj$xx[i]+cc*cos(th),
obj$yy[i]+cc*sin(th),lwd=4,
col=col.hab[habitat[lizard.tree$tip.label[i]]])
}
add.color.bar(max(nodeHeights(lizard.tree)),habitat.dMap$cols,
title="PP(state = \"saxicolous\")",
prompt=FALSE,lwd=6,outline=FALSE,x=0.95*par()$usr[1],
y=0.9*par()$usr[4],subtitle="",fsize=0.8)
h<-max(nodeHeights(lizard.tree))
ii<-order(rowSums(pp^2),decreasing=TRUE)
for(i in ii){
plotrix::floating.pie(obj$xx[i+Ntip(lizard.tree)],
obj$yy[i+Ntip(lizard.tree)],
radius=if(max(pp[i,])<=0.95) 0.03*h else 0.01*h,
x=pp[i,],col=col.hab,
border=if(max(pp[i,])<=0.95) "black" else "transparent")
}
legend(x=0.98*par()$usr[1],y=0.9*par()$usr[4],
levels(habitat),pch=21,pt.bg=col.hab,
pt.cex=2,bty="n",cex=0.8)
Looks like it worked!
No comments:
Post a Comment
Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.