## Friday, August 16, 2024

### Ancestral state estimates from the multi-state threshold model fit using the discrete approximation (compared to `ancThresh`)

In some recent posts to this blog (e.g., 1, 2) I’ve demonstrated marginal ancestral state estimation under the multi-state threshold model using the discrete approximation of Boucher & DÃ©mery (2016). (To learn more about the discrete approximation more generally, as well as Boucher & DÃ©mery’s fascinating paper, I recommend checking out my Evolution Meeting 2024 talk, which is available on YouTube.)

Readers of this blog, however, might also be familiar with a 2014 Evolution article by me in which I describe ancestral state reconstruction under the threshold model using Bayesian MCMC. The method described in this 2014 article is implemented in the phytools function `ancThresh` and has been used numerous times in published articles. In my article, I describe using MCMC to jointly sample ancestral & tip liabilities, and thresholds between discrete character levels, from their posterior probability distribution.

A relevant question is, of course, how do these two different approaches compare? Do they produce the same set of marginal node states?

Fortunately, the answer is more or less: yes – though, of course, we must ensure that our MCMC converges on the posterior distribution, which can be quite slow, depending on the nature of our data and the size of the tree. I would be inclined to recommend using `fitThresh` and `ancr` (for the numerous advantages so doing offers, e.g., as discussed here); however, we don’t need to be overly concerned about past results obtained using `ancThresh`, so long as convergence to the posterior was realized.

``````library(phytools)
``````

Now, I’ll proceed to simulate a tree a character evolving under the threshold model, as follows. I’m keeping my tree small – just so I can run a fairly large number of generations of MCMC (107, in fact) without needing to dedicate many weeks to this post!

``````tree<-pbtree(n=40,scale=1)
liability<-fastBM(tree,a=1.5)
thresholds<-setNames(c(0,1,Inf),
c("a","b","c"))
x<-as.factor(threshState(liability,thresholds))
``````

Our tree is a standard `"phylo"` object with 40 tips.

``````tree
``````
``````##
## Phylogenetic tree with 40 tips and 39 internal nodes.
##
## Tip labels:
##   t13, t14, t6, t7, t19, t20, ...
##
## Rooted; includes branch lengths.
``````

While our character data is in a factor vector with names that correspond to our species labels.

``````x
``````
``````## t13 t14  t6  t7 t19 t20 t12 t29 t30 t39 t40 t36 t21 t27 t28 t22 t34 t35 t25 t26 t24 t23  t9
##   a   a   a   a   b   b   b   c   c   c   b   b   c   b   b   b   a   a   a   a   a   b   a
##  t4 t15 t16  t8 t32 t33 t31  t1 t10 t11  t2 t17 t18  t3  t5 t37 t38
##   a   b   b   a   c   c   c   c   c   c   c   c   c   c   c   c   c
## Levels: a b c
``````

Next, I’m going to run `ancThresh`. Readers following along may choose to run fewer generations, because this can take a while!

``````anc_mcmc<-ancThresh(tree,x,ngen=1e7,
control=list(print=FALSE))
``````
``````## **** NOTE: no sequence provided, using alphabetical or numerical order
## MCMC starting....
``````
``````anc_mcmc
``````
``````##
## Object containing the results from an MCMC analysis
## of the threshold model using ancThresh.
##
## List with the following components:
## ace:	matrix with posterior probabilities assuming 2e+06
## 	burn-in generations.
## mcmc:	posterior sample of liabilities at tips & internal
## 	nodes (a matrix with 10001 rows & 39 columns).
## par:	posterior sample of the relative positions of the
## 	thresholds, the log-likelihoods, and any other
##  	model variables (a matrix with 10001 rows).
##
## The MCMC was run under the following conditions:
## 	seq = a <-> b <-> c
## 	model = BM
## 	number of generations = 1e+07
## 	sample interval= 1000
## 	burn-in = 2e+06
``````

Next, let’s fit our threshold model under the discrete approximation of Boucher & DÃ©mery (2016) using `phytools::fitThresh`. This still takes a second for `levs=400` (we could’ve used `levs=200` to no great ill effect), but it is much, much quicker than `ancThresh`.

``````thresh_mle<-fitThresh(tree,x,levs=400,
root="fitzjohn")
thresh_mle
``````
``````## Object of class "fitThresh".
##
##     Set value of sigsq (of the liability) = 1.0
##
## 	  Set or estimated threshold(s) =  [ -0.717763, 0.038735 ]*
##
##     Log-likelihood: -26.370495
##
## (*lowermost threshold is fixed)
``````

Then, to obtain marginal ancestral states, all we need to do is call the `ancr` S3 method on our fitted model object (`thresh_mle`) as follows.

``````anc_mle<-ancr(thresh_mle)
anc_mle
``````
``````## Marginal ancestral state estimates:
##           a        b        c
## 41 0.003175 0.232452 0.764373
## 42 0.445464 0.542706 0.011830
## 43 0.915251 0.084664 0.000084
## 44 0.452606 0.537200 0.010193
## 45 0.383487 0.602590 0.013924
## 46 0.747646 0.250047 0.002308
## ...
##
## Log-likelihood = -26.370495
``````

Let’s make a plot comparing the two different reconstructions:

``````par(mfrow=c(1,2))
cols<-colorRampPalette(c("blue","yellow"))(3)
plot(anc_mcmc,node.cex=0.8,tip.cex=0.6,ftype="off",
mar=c(0.1,0.1,2.1,0.1))
plotTree(tree,ylim=c(-0.1*Ntip(tree),Ntip(tree)),
ftype="off",offset=0.5,lwd=1,
mar=c(0.1,0.1,2.1,0.1))
nodelabels(pie=anc_mle\$ace,cex=0.8,piecol=cols)
tiplabels(pie=to.matrix(x,levels(x)),cex=0.6,
piecol=cols)
mtext("b) estimates from fitThresh > ancr",line=0,
``````

This gives the impression that our estimates are very nearly, if not precisely, identical. If anything, the global root states would seem to be most different. (This may have something to do with our choice of root prior. I’m not sure!)

Let’s plot them together & find out.

``````par(mar=c(5.1,4.1,1.1,1.1))
plot(anc_mle\$ace,anc_mcmc\$ace,bty="n",pch=21,cex=1.5,
bg=make.transparent("blue",0.25),las=1,
xlab="marginal estimates from fitThresh > ancr",
ylab="marginal estiates from ancThresh MCMC",
cex.axis=0.7,cex.lab=0.8)
abline(a=0,b=1)
grid()
``````

We can see that our estimates fall quite close to a 1:1 line. The few that fall off it are actually the global root.

To see that, we can plot it again with the root node values shown in red.

``````par(mar=c(5.1,4.1,1.1,1.1))
plot(anc_mle\$ace[2:tree\$Nnode,],anc_mcmc\$ace[2:tree\$Nnode,],
bty="n",pch=21,cex=1.5,
bg=make.transparent("blue",0.25),las=1,
xlab="marginal estimates from fitThresh > ancr",
ylab="marginal estiates from ancThresh MCMC",
cex.axis=0.7,cex.lab=0.8)
points(anc_mle\$ace[1,],anc_mcmc\$ace[1,],
pch=21,cex=1.5,bg=make.transparent("red",0.25))
abline(a=0,b=1)
grid()
``````

That's very close -- and, of course, we shouldn't expect them to be precisely the same because the former estimates come from Bayesian MCMC and our default prior is probably not totally uninformative (that's invariably hard to guarantee).

OK, thhat’s it for now folks!