A few months ago I posted about the most important phylogenetic comparative methods paper you’ve probably never read. In that post, I described the clever discrete approximation of the likelihood function for bounded Brownian motion developed by Boucher & DĂ©mery (2016) and anticipated it might have some other equally-interesting applications.
One that I thought of was as a method to compute the (asymptotically) exact likelihood under the threshold model of Felsenstein (2005, 2012). Under the threshold model, our observed character is discrete – but is underlain by a continuously-valued trait (liability) whose value relative to one or more “thresholds” determines which discrete character level manifests in each terminus of the phylogeny. An illustration of evolution under the threshold model is given below.
library(phytools)
layout(matrix(c(1,2),1,2),widths=c(0.55,0.45))
par(mar=c(5.1,4.1,1.1,1.1),las=1,cex.axis=0.8)
obj<-bmPlot(pbtree(b=0.03,n=40,t=100,type="discrete",quiet=TRUE),
type="threshold",thresholds=0,anc=0.5,sig2=1/100,
ngen=100,colors=hcl.colors(2),return.tree=TRUE,
bty="n")
plot(obj$tree,colors=setNames(hcl.colors(2),c("a","b")),
direction="upwards",ftype="off",mar=c(5.1,1.1,1.1,1.1))
In the original article by Felsenstein (2005), Joe remarks that in order to compute the probability of the observed (discrete data) we must approximate “a high-dimensional integral for which no closed-form formula exists…. [t]he integral is approximated by Monte Carlo integration—drawing a large sample of points from the domain of integration and evaluating the contribution to the integral in the vicinity of these points by evaluating the function at the points…. we replace the continuous function that is being integrated by a histogram that approximates it.” He proceeds on to describe how this can be done using MCMC.
Based on Boucher & DĂ©mery (2016), it occurred to me that this could also be done in a different way – by replacing our underlying continuous process with a discrete one that converges on the continuous one as our step size shrinks.
Here is my function to do this using phytools::fitMk
internally, as follows. So far, it can only handle a two-level threshold character (one threshold), but I’ve designed it to update to two or more thresholds (three or more character levels) in the future.
## compute trace (used internally)
trace<-function(X) sum(diag(X))
## fit threshold model using discrete approximation
fitThresh<-function(tree,x,...){
if(hasArg(levs)) levs<-list(...)$levs
else levs<-100
if(hasArg(root)) root<-list(...)$root
else root<-"fitzjohn"
C<-vcv(tree)
N<-Ntip(tree)
Ed<-trace(C)/N-sum(C)/(N^2)
lims<-qnorm(c(0.005,0.995),sd=sqrt(Ed))
liability<-seq(lims[1],lims[2],length.out=levs)
if(!is.factor(x)) x<-setNames(as.factor(x),names(x))
X<-to.matrix(x,levels(x))
Y<-matrix(0,N,levs,dimnames=list(rownames(X),
round(liability,3)))
Y[,liability<0]<-X[,1]
Y[,liability>=0]<-X[,2]
q<-1/(2*(diff(lims)/levs)^2)
model<-matrix(0,levs,levs,
dimnames=list(colnames(Y),colnames(Y)))
ind<-cbind(1:(nrow(model)-1),2:nrow(model))
model[ind]<-1
model[ind[,2:1]]<-1
fit<-fitMk(tree,Y,model=model,lik.func="parallel",
expm.method="R_Eigen",q.init=q,min.q=q,max.q=q,
pi=root)
object<-list(sigsq=1.0,bounds=lims,ncat=levs,
liability=liability,threshold=0.0,
logLik=logLik(fit),tree=tree,data=X,mk_fit=fit)
class(object)<-c("fitThresh","fitMk")
object
}
## print for "fitThresh" object
print.fitThresh<-function(x, digits=6, ...){
cat("Object of class \"fitThresh\".\n")
cat(" Set value of sigsq (of the liability) = 1.0\n")
cat(paste(" Set or estimated threshold(s) = [",
paste(x$threshold,collapse=","),"]\n"))
cat(paste(" Log-likelihood:", round(x$logLik, digits),
"\n\n"))
}
## logLik for "fitThresh" object
logLik.fitThresh<-function(object,...) object$logLik
## marginal ancestral states of "fitThresh" object
ancr.fitThresh<-function(object,...){
anc_mk<-ancr(object$mk_fit,type="marginal")
anc<-matrix(0,object$tree$Nnode,2,
dimnames=list(1:object$tree$Nnode+Ntip(object$tree),
colnames(object$data)))
for(i in 1:length(object$threshold)){
for(j in 1:nrow(anc)){
anc[j,1]<-sum(anc_mk$ace[j,
object$liability<object$threshold[i]])
}
}
anc[,ncol(anc)]<-1-rowSums(anc)
result<-list(ace=anc,logLik=logLik(object$mk_fit))
attr(result,"type")<-"marginal"
attr(result,"tree")<-object$tree
attr(result,"data")<-object$data
class(result)<-"ancr"
result
}
Let’s try it out.
First, I’m going to simulate a character under the threshold model. We can do that just by generating liabilities using Brownian motion & then “thresholding” them.
tree<-pbtree(n=200,scale=1)
x<-fastBM(tree,a=0.5,internal=TRUE)
thresh<-function(x) if(x<0) "a" else "b"
y<-sapply(x,thresh)
y_obs<-y[tree$tip.label]
head(y_obs,60)
## t14 t22 t76 t77 t134 t199 t200 t88 t3 t165 t166 t109 t119 t120 t27 t28 t128 t129 t127
## "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b"
## t197 t198 t75 t125 t126 t98 t157 t158 t70 t71 t20 t15 t54 t55 t153 t154 t36 t172 t193
## "b" "b" "b" "b" "b" "b" "b" "a" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b" "b"
## t194 t51 t37 t60 t61 t101 t102 t23 t79 t187 t188 t191 t192 t130 t131 t147 t148 t161 t162
## "b" "b" "b" "b" "b" "b" "a" "b" "a" "a" "b" "a" "a" "a" "a" "a" "a" "a" "a"
## t38 t105 t106
## "a" "a" "b"
thresh_fit<-fitThresh(tree,y_obs)
thresh_fit
## Object of class "fitThresh".
## Set value of sigsq (of the liability) = 1.0
## Set or estimated threshold(s) = [ 0 ]
## Log-likelihood: -94.720874
mk_fit<-fitMk(tree,y_obs,model="ARD")
mk_fit
## Object of class "fitMk".
##
## Fitted (or set) value of Q:
## a b
## a -2.517655 2.517655
## b 0.700836 -0.700836
##
## Fitted (or set) value of pi:
## a b
## 0.5 0.5
## due to treating the root prior as (a) flat.
##
## Log-likelihood: -96.948999
##
## Optimization method used was "nlminb"
##
## R thinks it has found the ML solution.
anova(mk_fit,thresh_fit)
## log(L) d.f. AIC weight
## mk_fit -96.94900 2 197.8980 0.03812094
## thresh_fit -94.72087 1 191.4417 0.96187906
It looks like our data are more consistent with this model than with, say, and “ARD” Mk model – which makes sense, as this was the generating model for simulation!
Lastly, let’s compare the accuracy of ancestral states between the threshold model & our Mk model.
anc_thresh<-ancr(thresh_fit)
anc_thresh
## Marginal ancestral state estimates:
## a b
## 201 0.025311 0.974689
## 202 0.056100 0.943900
## 203 0.005494 0.994506
## 204 0.004201 0.995799
## 205 0.001300 0.998700
## 206 0.000894 0.999106
## ...
##
## Log-likelihood = -94.720874
anc_mk<-ancr(mk_fit)
anc_mk
## Marginal ancestral state estimates:
## a b
## 201 0.589915 0.410085
## 202 0.551008 0.448992
## 203 0.297238 0.702762
## 204 0.292654 0.707346
## 205 0.105045 0.894955
## 206 0.061611 0.938389
## ...
##
## Log-likelihood = -96.948999
We might measure the accuracy as the mean squared difference between a binary matrix giving the true states and our marginal reconstruction. The closer this value is to zero, the more accurate our reconstruction.
true_states<-to.matrix(y[1:tree$Nnode+Ntip(tree)],c("a","b"))
head(true_states)
## a b
## 201 0 1
## 202 0 1
## 203 0 1
## 204 0 1
## 205 0 1
## 206 0 1
thresh_accuracy<-mean((true_states-anc_thresh$ace)^2)
thresh_accuracy
## [1] 0.09300212
mk_accuracy<-mean((true_states-anc_mk$ace)^2)
mk_accuracy
## [1] 0.1315431
Neat. This is just one simulation, but I’ve done it a few times & the improvement in accuracy seems remarkably consistent.
Finally, let’s plot our reconstruction.
plot(anc_thresh,direction="upwards",ftype="off",
args.nodelabels=list(piecol=hcl.colors(2),cex=0.3),
args.tiplabels=list(piecol=hcl.colors(2)))
That’s all, folks!
No comments:
Post a Comment
Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.