## Saturday, July 25, 2015

### Integrating stochastic character maps across multiple character transition models

I recently fielded an interesting question by Oscar Inostroza from the Universidad de Concepción how to choose among alternative models for the substitution process for stochastic mapping when different models have comparable support. Presently, it is possible to condition on a particular model (such as, for example, `model="ER"`, the “equal rates” model), then either fix the transition matrix `Q` at it's empirical ML value; or sample `Q` from its posterior distribution conditioned on the selected model. It is not possible, however, to sample character histories conditioned on multiple alternative models for character change between states.

Ideally what we might like to do in this case is design a reversible-jump MCMC to sample among models for character transition in proportion to their posterior probabilities; however unfortunately this is not practical at the current time. What I suggested instead is that we use Akaike weights to generate stochastic maps under all of the alternative models under consideration, in proportion to (or, approximately in proportion to - since we must use a finite number of simulations) the weight of evidence in support of that model.

There are a few tricks to doing this. In the following, I'll illustrate how it can be done using phytools.

First, some preliminaries. Let's start by loading phytools & simulating some data for the present case. I will simulate data under a `"SYM"` (symmetric) transition model, but in which the rates are not too dissimilar between states - which I hope will create fairly even support for the `"SYM"` & `"ER"` models on a modest sized phylogeny.

``````## load packages
library(phytools)

## simulate some data
tree<-pbtree(n=40,scale=1)
Q<-matrix(c(-1,0.75,0.25,
0.75,-1.25,0.5,
0.25,0.5,-0.75),3,3)
rownames(Q)<-colnames(Q)<-letters[1:3]
Q
``````
``````##       a     b     c
## a -1.00  0.75  0.25
## b  0.75 -1.25  0.50
## c  0.25  0.50 -0.75
``````
``````x<-sim.history(tree,Q)\$states
``````
``````## Done simulation(s).
``````
``````x
``````
``````## t24 t33 t34 t10  t9 t25 t26 t21 t22  t1 t39 t40 t20 t28 t29 t31 t32 t19
## "c" "c" "c" "c" "c" "a" "a" "b" "c" "b" "b" "b" "b" "c" "b" "b" "b" "b"
## t13 t23 t30 t35 t36 t27 t14 t15  t3  t6  t7  t2 t17 t18  t8  t4  t5 t37
## "b" "a" "a" "a" "a" "a" "a" "a" "a" "c" "a" "a" "a" "a" "a" "a" "a" "c"
## t38 t16 t11 t12
## "c" "b" "a" "a"
``````

Next, here are some simple functions that we'll use later to compute the AIC scores and the Akaike weights for each model:

``````aic<-function(logL,k) 2*k-2*logL
aic.w<-function(aic){
d.aic<-aic-min(aic)
exp(-1/2*d.aic)/sum(exp(-1/2*d.aic))
}
``````

Now, we need to fit & (more importantly) obtain the log-likelihood of each fitted model. We could actually do this with multiple functions in R (`ace`, `fitDiscrete`, & `optim.pml`, among others); however here I will use `make.simmap`, which actually computes the likelihood using code modified from `ace`.

``````## compute log likelihoods
logL<-sapply(c("ER","SYM","ARD"),
function(model,tree,x) make.simmap(tree,x,model)\$logL,
tree=tree,x=x)
``````
``````## make.simmap is sampling character histories conditioned on the transition matrix
## Q =
``````
``````##            a          b          c
## a -0.7389255  0.3694628  0.3694628
## b  0.3694628 -0.7389255  0.3694628
## c  0.3694628  0.3694628 -0.7389255
``````
``````## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
``````
``````##         a         b         c
## 0.3333333 0.3333333 0.3333333
``````
``````## Done.
## make.simmap is sampling character histories conditioned on the transition matrix
## Q =
``````
``````##            a         b          c
## a -0.5566061  0.000000  0.5566061
## b  0.0000000 -1.158184  1.1581840
## c  0.5566061  1.158184 -1.7147901
``````
``````## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
``````
``````##         a         b         c
## 0.3333333 0.3333333 0.3333333
``````
``````## Done.
## make.simmap is sampling character histories conditioned on the transition matrix
## Q =
``````
``````##            a           b          c
## a -0.5197351  0.03634396  0.4833911
## b  0.0000000 -0.65052948  0.6505295
## c  1.4205225  0.95975112 -2.3802736
``````
``````## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
``````
``````##         a         b         c
## 0.3333333 0.3333333 0.3333333
``````
``````## Done.
``````
``````logL
``````
``````##        ER       SYM       ARD
## -28.55567 -26.84571 -26.38334
``````
``````## now let's compute AIC values & weights
## (in a real study, we might use AICc)
AIC<-mapply(aic,logL,c(1,3,6))
AIC
``````
``````##       ER      SYM      ARD
## 59.11134 59.69141 64.76668
``````
``````AIC.W<-aic.w(AIC)
AIC.W
``````
``````##         ER        SYM        ARD
## 0.55328515 0.41398768 0.03272717
``````

In our next step, we can “normalize” our weights to the number of simulations we want to use for stochastic mapping. The trick here is to remember that if we just round the product of the weights (which are currently normalized to sum to one) × the desired total number of simulations, we might end up with a total number of simulations less than or greater than our desired number. In the code below I (arbitrarily) add (or subtract) the deficit (or surplus) to models chosen at random; but this need not be our strategy.

``````## now let's normalize this to our number of stochastic
## mapping simulations
nsim<-1000
Nsim<-round(nsim*AIC.W)
d<-if(sum(Nsim)>nsim) -1 else 1
nsim<-Nsim+d*sample(c(rep(1,abs(nsim-sum(Nsim))),
rep(0,length(Nsim)-abs(nsim-sum(Nsim)))))
nsim
``````
``````##  ER SYM ARD
## 553 414  33
``````

Finally, I will perform `nsim` stochastic mapping simulations for each model & then combine stochastic maps conducted across all of our models into one object of class `"multiPhylo"`:

``````## remove any with nsim==0
nsim<-nsim[nsim!=0]
trees<-list()
class(trees)<-"multiPhylo"
for(i in 1:length(nsim)){
obj<-make.simmap(tree,x,model=names(nsim)[i],nsim=nsim[i])
## we could also use:
# obj<-make.simmap(tree,x,model=names(nsim)[i],nsim=nsim[i],Q="mcmc")
## but it would be slower
if(nsim[i]==1){
obj<-list(obj)
class(obj)<-"multiPhylo"
}
trees<-c(trees,obj)
}
``````
``````## make.simmap is sampling character histories conditioned on the transition matrix
## Q =
``````
``````##            a          b          c
## a -0.7389255  0.3694628  0.3694628
## b  0.3694628 -0.7389255  0.3694628
## c  0.3694628  0.3694628 -0.7389255
``````
``````## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
``````
``````##         a         b         c
## 0.3333333 0.3333333 0.3333333
``````
``````## Done.
## make.simmap is sampling character histories conditioned on the transition matrix
## Q =
``````
``````##            a         b          c
## a -0.5566061  0.000000  0.5566061
## b  0.0000000 -1.158184  1.1581840
## c  0.5566061  1.158184 -1.7147901
``````
``````## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
``````
``````##         a         b         c
## 0.3333333 0.3333333 0.3333333
``````
``````## Done.
## make.simmap is sampling character histories conditioned on the transition matrix
## Q =
``````
``````##            a           b          c
## a -0.5197351  0.03634396  0.4833911
## b  0.0000000 -0.65052948  0.6505295
## c  1.4205225  0.95975112 -2.3802736
``````
``````## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
``````
``````##         a         b         c
## 0.3333333 0.3333333 0.3333333
``````
``````## Done.
``````

With this object, it is possible to do any of the other standard type of things, such as use `describe.simmap` to summarize the results of our analysis, including posterior probabilities at nodes, etc. For instance:

``````obj<-describe.simmap(trees)
obj
``````
``````## 1000 trees with a mapped discrete character with states:
##  a, b, c
##
## trees have 11.26 changes between states on average
##
## changes are of the following types:
##        a,b   a,c   b,a   b,c   c,a   c,b
## x->y 1.238 3.593 0.254 1.816 1.667 2.692
##
## mean total time spent in each state is:
##              a         b         c    total
## raw  8.5813823 2.6669467 2.6141229 13.86245
## prop 0.6190378 0.1923864 0.1885758  1.00000
``````
``````plot(obj)
`````` and so on.

That's all for now on this topic.

1. I fixed one small typo in a previous version of this post. The line:

obj<-describe.simmap(trees)

was previously incorrectly given as:

obj<-describe.simmap(obj)

2. Very useful informations for starter. Waiting for the next.
Coursework Help Service
Thanks

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.