I recently fielded an interesting question by
Oscar
Inostroza from the Universidad de Concepción how to choose among
alternative models for the substitution process for stochastic mapping
when different models have comparable support. Presently, it is possible
to condition on a particular model (such as, for example,
model="ER"
, the “equal rates” model), then either fix the
transition matrix Q
at it's empirical ML value; or sample
Q
from its posterior distribution conditioned on the selected
model. It is not possible, however, to sample character histories
conditioned on multiple alternative models for character change between
states.
Ideally what we might like to do in this case is design a reversible-jump MCMC to sample among models for character transition in proportion to their posterior probabilities; however unfortunately this is not practical at the current time. What I suggested instead is that we use Akaike weights to generate stochastic maps under all of the alternative models under consideration, in proportion to (or, approximately in proportion to - since we must use a finite number of simulations) the weight of evidence in support of that model.
There are a few tricks to doing this. In the following, I'll illustrate how it can be done using phytools.
First, some preliminaries. Let's start by loading phytools & simulating
some data for the present case. I will simulate data under a
"SYM"
(symmetric) transition model, but in which the rates
are not too dissimilar between states - which I hope will create fairly even
support for the "SYM"
& "ER"
models on a modest
sized phylogeny.
## load packages
library(phytools)
## simulate some data
tree<-pbtree(n=40,scale=1)
Q<-matrix(c(-1,0.75,0.25,
0.75,-1.25,0.5,
0.25,0.5,-0.75),3,3)
rownames(Q)<-colnames(Q)<-letters[1:3]
Q
## a b c
## a -1.00 0.75 0.25
## b 0.75 -1.25 0.50
## c 0.25 0.50 -0.75
x<-sim.history(tree,Q)$states
## Done simulation(s).
x
## t24 t33 t34 t10 t9 t25 t26 t21 t22 t1 t39 t40 t20 t28 t29 t31 t32 t19
## "c" "c" "c" "c" "c" "a" "a" "b" "c" "b" "b" "b" "b" "c" "b" "b" "b" "b"
## t13 t23 t30 t35 t36 t27 t14 t15 t3 t6 t7 t2 t17 t18 t8 t4 t5 t37
## "b" "a" "a" "a" "a" "a" "a" "a" "a" "c" "a" "a" "a" "a" "a" "a" "a" "c"
## t38 t16 t11 t12
## "c" "b" "a" "a"
Next, here are some simple functions that we'll use later to compute the AIC scores and the Akaike weights for each model:
aic<-function(logL,k) 2*k-2*logL
aic.w<-function(aic){
d.aic<-aic-min(aic)
exp(-1/2*d.aic)/sum(exp(-1/2*d.aic))
}
Now, we need to fit & (more importantly) obtain the log-likelihood of
each fitted model. We could actually do this with multiple functions in
R (ace
, fitDiscrete
, & optim.pml
,
among others); however here I will use make.simmap
, which
actually computes the likelihood using code modified from ace
.
## compute log likelihoods
logL<-sapply(c("ER","SYM","ARD"),
function(model,tree,x) make.simmap(tree,x,model)$logL,
tree=tree,x=x)
## make.simmap is sampling character histories conditioned on the transition matrix
## Q =
## a b c
## a -0.7389255 0.3694628 0.3694628
## b 0.3694628 -0.7389255 0.3694628
## c 0.3694628 0.3694628 -0.7389255
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
## a b c
## 0.3333333 0.3333333 0.3333333
## Done.
## make.simmap is sampling character histories conditioned on the transition matrix
## Q =
## a b c
## a -0.5566061 0.000000 0.5566061
## b 0.0000000 -1.158184 1.1581840
## c 0.5566061 1.158184 -1.7147901
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
## a b c
## 0.3333333 0.3333333 0.3333333
## Done.
## make.simmap is sampling character histories conditioned on the transition matrix
## Q =
## a b c
## a -0.5197351 0.03634396 0.4833911
## b 0.0000000 -0.65052948 0.6505295
## c 1.4205225 0.95975112 -2.3802736
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
## a b c
## 0.3333333 0.3333333 0.3333333
## Done.
logL
## ER SYM ARD
## -28.55567 -26.84571 -26.38334
## now let's compute AIC values & weights
## (in a real study, we might use AICc)
AIC<-mapply(aic,logL,c(1,3,6))
AIC
## ER SYM ARD
## 59.11134 59.69141 64.76668
AIC.W<-aic.w(AIC)
AIC.W
## ER SYM ARD
## 0.55328515 0.41398768 0.03272717
In our next step, we can “normalize” our weights to the number of simulations we want to use for stochastic mapping. The trick here is to remember that if we just round the product of the weights (which are currently normalized to sum to one) × the desired total number of simulations, we might end up with a total number of simulations less than or greater than our desired number. In the code below I (arbitrarily) add (or subtract) the deficit (or surplus) to models chosen at random; but this need not be our strategy.
## now let's normalize this to our number of stochastic
## mapping simulations
nsim<-1000
Nsim<-round(nsim*AIC.W)
d<-if(sum(Nsim)>nsim) -1 else 1
nsim<-Nsim+d*sample(c(rep(1,abs(nsim-sum(Nsim))),
rep(0,length(Nsim)-abs(nsim-sum(Nsim)))))
nsim
## ER SYM ARD
## 553 414 33
Finally, I will perform nsim
stochastic mapping simulations for
each model & then combine stochastic maps conducted across all of our
models into one object of class "multiPhylo"
:
## remove any with nsim==0
nsim<-nsim[nsim!=0]
trees<-list()
class(trees)<-"multiPhylo"
for(i in 1:length(nsim)){
obj<-make.simmap(tree,x,model=names(nsim)[i],nsim=nsim[i])
## we could also use:
# obj<-make.simmap(tree,x,model=names(nsim)[i],nsim=nsim[i],Q="mcmc")
## but it would be slower
if(nsim[i]==1){
obj<-list(obj)
class(obj)<-"multiPhylo"
}
trees<-c(trees,obj)
}
## make.simmap is sampling character histories conditioned on the transition matrix
## Q =
## a b c
## a -0.7389255 0.3694628 0.3694628
## b 0.3694628 -0.7389255 0.3694628
## c 0.3694628 0.3694628 -0.7389255
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
## a b c
## 0.3333333 0.3333333 0.3333333
## Done.
## make.simmap is sampling character histories conditioned on the transition matrix
## Q =
## a b c
## a -0.5566061 0.000000 0.5566061
## b 0.0000000 -1.158184 1.1581840
## c 0.5566061 1.158184 -1.7147901
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
## a b c
## 0.3333333 0.3333333 0.3333333
## Done.
## make.simmap is sampling character histories conditioned on the transition matrix
## Q =
## a b c
## a -0.5197351 0.03634396 0.4833911
## b 0.0000000 -0.65052948 0.6505295
## c 1.4205225 0.95975112 -2.3802736
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
## a b c
## 0.3333333 0.3333333 0.3333333
## Done.
With this object, it is possible to do any of the other standard type of
things, such as use describe.simmap
to summarize the results
of our analysis, including posterior probabilities at nodes, etc. For
instance:
obj<-describe.simmap(trees)
obj
## 1000 trees with a mapped discrete character with states:
## a, b, c
##
## trees have 11.26 changes between states on average
##
## changes are of the following types:
## a,b a,c b,a b,c c,a c,b
## x->y 1.238 3.593 0.254 1.816 1.667 2.692
##
## mean total time spent in each state is:
## a b c total
## raw 8.5813823 2.6669467 2.6141229 13.86245
## prop 0.6190378 0.1923864 0.1885758 1.00000
plot(obj)
and so on.
That's all for now on this topic.
I fixed one small typo in a previous version of this post. The line:
ReplyDeleteobj<-describe.simmap(trees)
was previously incorrectly given as:
obj<-describe.simmap(obj)
Very useful informations for starter. Waiting for the next.
ReplyDeleteCoursework Help Service
Thanks