Wednesday, May 8, 2024

Fitting and analyzing the threshold model in a phylogenetic comparative context with likelihood using the discrete approximation

A few months ago I posted about the most important phylogenetic comparative methods paper you’ve probably never read. In that post, I described the clever discrete approximation of the likelihood function for bounded Brownian motion developed by Boucher & DĂ©mery (2016) and anticipated it might have some other equally-interesting applications.

One that I thought of was as a method to compute the (asymptotically) exact likelihood under the threshold model of Felsenstein (2005, 2012). Under the threshold model, our observed character is discrete – but is underlain by a continuously-valued trait (liability) whose value relative to one or more “thresholds” determines which discrete character level manifests in each terminus of the phylogeny. An illustration of evolution under the threshold model is given below.

library(phytools)
layout(matrix(c(1,2),1,2),widths=c(0.55,0.45))
par(mar=c(5.1,4.1,1.1,1.1),las=1,cex.axis=0.8)
obj<-bmPlot(pbtree(b=0.03,n=40,t=100,type="discrete",quiet=TRUE),
type="threshold",thresholds=0,anc=0.5,sig2=1/100,
ngen=100,colors=hcl.colors(2),return.tree=TRUE,
bty="n")
plot(obj\$tree,colors=setNames(hcl.colors(2),c("a","b")),
direction="upwards",ftype="off",mar=c(5.1,1.1,1.1,1.1))

In the original article by Felsenstein (2005), Joe remarks that in order to compute the probability of the observed (discrete data) we must approximate “a high-dimensional integral for which no closed-form formula exists…. [t]he integral is approximated by Monte Carlo integration—drawing a large sample of points from the domain of integration and evaluating the contribution to the integral in the vicinity of these points by evaluating the function at the points…. we replace the continuous function that is being integrated by a histogram that approximates it.” He proceeds on to describe how this can be done using MCMC.

Based on Boucher & DĂ©mery (2016), it occurred to me that this could also be done in a different way – by replacing our underlying continuous process with a discrete one that converges on the continuous one as our step size shrinks.

Here is my function to do this using phytools::fitMk internally, as follows. So far, it can only handle a two-level threshold character (one threshold), but I’ve designed it to update to two or more thresholds (three or more character levels) in the future.

## compute trace (used internally)
trace<-function(X) sum(diag(X))

## fit threshold model using discrete approximation
fitThresh<-function(tree,x,...){
if(hasArg(levs)) levs<-list(...)\$levs
else levs<-100
if(hasArg(root)) root<-list(...)\$root
else root<-"fitzjohn"
C<-vcv(tree)
N<-Ntip(tree)
Ed<-trace(C)/N-sum(C)/(N^2)
lims<-qnorm(c(0.005,0.995),sd=sqrt(Ed))
liability<-seq(lims[1],lims[2],length.out=levs)
if(!is.factor(x)) x<-setNames(as.factor(x),names(x))
X<-to.matrix(x,levels(x))
Y<-matrix(0,N,levs,dimnames=list(rownames(X),
round(liability,3)))
Y[,liability<0]<-X[,1]
Y[,liability>=0]<-X[,2]
q<-1/(2*(diff(lims)/levs)^2)
model<-matrix(0,levs,levs,
dimnames=list(colnames(Y),colnames(Y)))
ind<-cbind(1:(nrow(model)-1),2:nrow(model))
model[ind]<-1
model[ind[,2:1]]<-1
fit<-fitMk(tree,Y,model=model,lik.func="parallel",
expm.method="R_Eigen",q.init=q,min.q=q,max.q=q,
pi=root)
object<-list(sigsq=1.0,bounds=lims,ncat=levs,
liability=liability,threshold=0.0,
logLik=logLik(fit),tree=tree,data=X,mk_fit=fit)
class(object)<-c("fitThresh","fitMk")
object
}

## print for "fitThresh" object
print.fitThresh<-function(x, digits=6, ...){
cat("Object of class \"fitThresh\".\n")
cat("    Set value of sigsq (of the liability) = 1.0\n")
cat(paste("    Set or estimated threshold(s) = [",
paste(x\$threshold,collapse=","),"]\n"))
cat(paste("    Log-likelihood:", round(x\$logLik, digits),
"\n\n"))
}

## logLik for "fitThresh" object
logLik.fitThresh<-function(object,...) object\$logLik

## marginal ancestral states of "fitThresh" object
ancr.fitThresh<-function(object,...){
anc_mk<-ancr(object\$mk_fit,type="marginal")
anc<-matrix(0,object\$tree\$Nnode,2,
dimnames=list(1:object\$tree\$Nnode+Ntip(object\$tree),
colnames(object\$data)))
for(i in 1:length(object\$threshold)){
for(j in 1:nrow(anc)){
anc[j,1]<-sum(anc_mk\$ace[j,
object\$liability<object\$threshold[i]])
}
}
anc[,ncol(anc)]<-1-rowSums(anc)
result<-list(ace=anc,logLik=logLik(object\$mk_fit))
attr(result,"type")<-"marginal"
attr(result,"tree")<-object\$tree
attr(result,"data")<-object\$data
class(result)<-"ancr"
result
}

Let’s try it out.

First, I’m going to simulate a character under the threshold model. We can do that just by generating liabilities using Brownian motion & then “thresholding” them.

tree<-pbtree(n=200,scale=1)
x<-fastBM(tree,a=0.5,internal=TRUE)
thresh<-function(x) if(x<0) "a" else "b"
y<-sapply(x,thresh)
y_obs<-y[tree\$tip.label]
##  t14  t22  t76  t77 t134 t199 t200  t88   t3 t165 t166 t109 t119 t120  t27  t28 t128 t129 t127
##  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "b"
## t197 t198  t75 t125 t126  t98 t157 t158  t70  t71  t20  t15  t54  t55 t153 t154  t36 t172 t193
##  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "a"  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "b"
## t194  t51  t37  t60  t61 t101 t102  t23  t79 t187 t188 t191 t192 t130 t131 t147 t148 t161 t162
##  "b"  "b"  "b"  "b"  "b"  "b"  "a"  "b"  "a"  "a"  "b"  "a"  "a"  "a"  "a"  "a"  "a"  "a"  "a"
##  t38 t105 t106
##  "a"  "a"  "b"
thresh_fit<-fitThresh(tree,y_obs)
thresh_fit
## Object of class "fitThresh".
##     Set value of sigsq (of the liability) = 1.0
##     Set or estimated threshold(s) = [ 0 ]
##     Log-likelihood: -94.720874
mk_fit<-fitMk(tree,y_obs,model="ARD")
mk_fit
## Object of class "fitMk".
##
## Fitted (or set) value of Q:
##           a         b
## a -2.517655  2.517655
## b  0.700836 -0.700836
##
## Fitted (or set) value of pi:
##   a   b
## 0.5 0.5
## due to treating the root prior as (a) flat.
##
## Log-likelihood: -96.948999
##
## Optimization method used was "nlminb"
##
## R thinks it has found the ML solution.
anova(mk_fit,thresh_fit)
##               log(L) d.f.      AIC     weight
## mk_fit     -96.94900    2 197.8980 0.03812094
## thresh_fit -94.72087    1 191.4417 0.96187906

It looks like our data are more consistent with this model than with, say, and “ARD” Mk model – which makes sense, as this was the generating model for simulation!

Lastly, let’s compare the accuracy of ancestral states between the threshold model & our Mk model.

anc_thresh<-ancr(thresh_fit)
anc_thresh
## Marginal ancestral state estimates:
##            a        b
## 201 0.025311 0.974689
## 202 0.056100 0.943900
## 203 0.005494 0.994506
## 204 0.004201 0.995799
## 205 0.001300 0.998700
## 206 0.000894 0.999106
## ...
##
## Log-likelihood = -94.720874
anc_mk<-ancr(mk_fit)
anc_mk
## Marginal ancestral state estimates:
##            a        b
## 201 0.589915 0.410085
## 202 0.551008 0.448992
## 203 0.297238 0.702762
## 204 0.292654 0.707346
## 205 0.105045 0.894955
## 206 0.061611 0.938389
## ...
##
## Log-likelihood = -96.948999

We might measure the accuracy as the mean squared difference between a binary matrix giving the true states and our marginal reconstruction. The closer this value is to zero, the more accurate our reconstruction.

true_states<-to.matrix(y[1:tree\$Nnode+Ntip(tree)],c("a","b"))
##     a b
## 201 0 1
## 202 0 1
## 203 0 1
## 204 0 1
## 205 0 1
## 206 0 1
thresh_accuracy<-mean((true_states-anc_thresh\$ace)^2)
thresh_accuracy
## [1] 0.09300212
mk_accuracy<-mean((true_states-anc_mk\$ace)^2)
mk_accuracy
## [1] 0.1315431

Neat. This is just one simulation, but I’ve done it a few times & the improvement in accuracy seems remarkably consistent.

Finally, let’s plot our reconstruction.

plot(anc_thresh,direction="upwards",ftype="off",
args.nodelabels=list(piecol=hcl.colors(2),cex=0.3),
args.tiplabels=list(piecol=hcl.colors(2)))

That’s all, folks!