Tuesday, April 29, 2025

A new semi-threshold trait evolution model for phytools

For some time, I’ve been fascinated by the usefulness of the discretized diffusion approximation method of Boucher & Démery (2016) to obtain likelihood functions for specifically non-Gaussian evolutionary models for which no closed-form solution for the probability density under the model is available.

I can’t get into all of it here, but for those who are interested in getting some more details, I suggest checking out this blog post or my Evolution 2025 meeting talk on YouTube, along with numerous other posts to this blog (you can just search for "Boucher").

Prior uses of this approximation on my blog (and now in phytools) including fitting a multi-state threshold model, fitting discrete character & hidden-state dependent heterogeneous rate continuous character evolution models, and, of course, bounded Brownian motion.

Towards the end of last year, I posted on visualizing evolution under what I referred to as a “semi-threshold” trait evolution model.

Just as a reminder, the threshold model is a discrete character evolution model in which the observed trait is discretely-valued but is underlain by an unobserved continuous character called “liability.” Every time liability crosses some fixed threshold value, our discrete character changes in state.

A semi-threshold model, on the other hand, imagines that on some part of its range liability can be observed as a measurable quantitative trait. On another part of its range, by contrast, liability is not observable but nonetheless continues to evolve.

A simple example of a trait that could conceivably evolve in this manner might be “horn length” which can increase or decrease – but when it shrinks to zero the horn is absent and can no longer be measured. The ability to produce a horn (our “liability” trait, in this case), however, might continue to evolve even if the horn is no longer present and measurable. (That is, liability could continue to evolve away from simple absence, and thus make horns increasingly difficult to re-evolve.)

To get a sense of evolution under our “semi-threshold” model I created the visual simulation below (adapted from my previous post).

## load packages
library(phytools)
## set number of taxa
N<-80
## simulate a tree
tree<-pbtree(n=N,scale=100)
tree<-untangle(ladderize(tree),"read.tree")
## add a lot of non-branching nodes along each edge
tt<-map.to.singleton(make.era.map(tree,seq(0,100,by=1)))
## simulate liability (x)
x<-fastBM(tt,sig2=1.25,internal=TRUE,a=5)
## map the presence/absence of x as >0 or <0
maps<-list()
for(i in 1:nrow(tt$edge)){
  if(x[tt$edge[i,1]]<0 && x[tt$edge[i,2]]<0) 
    maps[[i]]<-setNames(tt$edge.length[i],"0")
  else if(x[tt$edge[i,1]]>0 && x[tt$edge[i,2]]>0) 
    maps[[i]]<-setNames(tt$edge.length[i],"1")
  else {
    d<-x[tt$edge[i,2]]-x[tt$edge[i,1]]
    if(d>0)
      maps[[i]]<-setNames(abs(c(-x[tt$edge[i,1]]/d,
        x[tt$edge[i,2]]/d))*tt$edge.length[i],
        0:1)
    else
      maps[[i]]<-setNames(abs(c(x[tt$edge[i,1]]/d,
        x[tt$edge[i,2]]/d))*tt$edge.length[i],
        1:0)
  }
}
tt$maps<-maps
class(tt)<-c("simmap","phylo")
## create phenogram plot
par(mar=c(5.1,5.1,1.1,1.1),mfrow=c(2,1))
tmp<-phenogram(tt,x,
  lty=setNames(c("solid","solid"),0:1),
  colors=setNames(c("black","blue"),0:1),
  lty=setNames(c("dotted","solid"),0:1),
  xlim=c(-5,110),ylim=rep(max(abs(x)),2)*c(-1,1),
  ylab="liability / phenotype",las=1,cex.axis=0.9,
  ftype="off")
lines(c(0,100),c(0,0))
## label the threshold
text(0,-0.06*max(abs(x)),"threshold",cex=0.8,pos=4,font=3)
## label the trait ranges
text(x=-7,y=mean(c(0,max(abs(x)))),"trait present",srt=90,
  cex=0.8,font=3)
lines(c(-3,-5,-5,-3),c(0.5,0.5,max(abs(x)),max(abs(x))))
text(x=-7,y=mean(c(0,-max(abs(x)))),"trait absent",srt=90,
  cex=0.8,font=3)
lines(c(-3,-5,-5,-3),c(-0.5,-0.5,-max(abs(x)),-max(abs(x))))
mtext("a)",line=0,adj=-0.1)
## create contMap plot
obj<-contMap(tt,x[1:Ntip(tt)],anc.states=x[1:tt$Nnode+Ntip(tt)],
  lims=rep(max(abs(x)),2)*c(-1,1),plot=FALSE,method="user")
## re-color so that any value of trait < 0 is grey
obj<-setMap(obj,c(rep("black",500),
  colorRampPalette(c("white","blue"))(500)))
## plot contMap
plot(obj,mar=c(1.1,5.1,0.1,1.1),legend=FALSE,
  xlim=c(-5,110),ftype="off",ylim=c(-0.2*N,N),
  lwd=3,outline=FALSE)
segments(x0=rep(100,Ntip(tt))+1,x1=tmp[,"x"],
  y0=1:Ntip(obj$tree),y1=1:Ntip(obj$tree),lty="dotted")
text(obj$tree$tip.label,x=tmp[,"x"],y=1:Ntip(obj$tree),
  pos=4,cex=0.6)
## add legend
add.color.bar(60,cols=colorRampPalette(c("white","blue"))(100),
  prompt=FALSE,lims=c(0,max(abs(x))),x=0,y=-0.1*N,
  title="quantitative trait value (if present)",subtitle="")
lines(x=c(0,10),y=rep(-0.2*N,2),lwd=6,lend=4)
lines(x=c(0,10),y=rep(-0.2*N,2),lwd=4,lend=4,col="black")
text(x=11,y=-0.2*N,"trait absent",pos=4)
mtext("b)",line=0,adj=-0.1)

plot of chunk unnamed-chunk-104

This visualization shows the evolution of normally unobserved liability (when below the threshold) and our observed continuous trait (when above it), in panel (a); and evolution of our semi-threshold trait mapped onto the branches of the tree, in panel (b). Once again, we could imagine that our trait is the horn – which can have a length if present, but for which we imagine that the underlying continuous basis may continue to change even if the trait is no longer produced!

For good measure, here’s how our semi-threshold trait is encoded in this case. Here, all values below zero (in this case) for the liability character are just set equal to zero, and our trait has been “semi-thresholded.”

## semi-threshold trait
y<-x[tree$tip.label]
y[y<0]<-0
print(y,digits=2)
##   t63   t64   t38   t24   t34   t35   t10   t52   t53   t18    t8   t58   t59   t45 
## 10.49 10.88 13.02  1.52 18.95  8.16  0.69 19.47 13.86 22.08 15.20  6.76  7.77  0.00 
##   t11    t6   t65   t66   t47   t39   t77   t78   t37   t79   t80   t36   t32   t31 
##  4.09  9.78 11.79  6.75  7.49 10.75 10.88 11.05  6.85  0.00  0.00  0.00  0.00  0.00 
##    t5   t42   t43   t71   t72   t15   t16   t27   t28   t23   t12    t2    t3    t4 
##  0.00  2.73  0.00  0.00  0.00  0.00  0.00  5.38 15.72  3.08 11.26 22.19 12.90 19.65 
##   t56   t57   t41    t1   t69   t70   t60   t33   t48   t49   t19   t20   t29   t30 
##  0.00  0.00  0.00  9.80 11.10  9.20  4.42  0.00  0.00  1.25  0.00  0.14  1.42  0.00 
##   t73   t74   t40   t22   t75   t76   t13   t14   t50   t51   t46   t17   t67   t68 
## 10.39  9.10  3.93 13.86 16.66 18.08 23.18 25.47  3.84  9.87  1.88  8.05 22.46 21.85 
##   t21    t7   t61   t62   t44   t25   t26   t54   t55    t9 
## 16.65 19.59 17.60 21.46 19.02  4.07  2.45 10.90 10.71  0.00

Today, I implemented a prototype fitSemiThresh function to fit the semi-threshold model to data using the discretized diffusion approximation to compute the likelihood, as follows.

fitSemiThresh<-function(tree,x,threshold=c(0,1),...){
  if(hasArg(trace)) trace<-list(...)$trace
  else trace<-0
  if(hasArg(levs)) levs<-list(...)$levs
  else levs<-200
  if(hasArg(root)) root=list(...)$root
  else root<-"mle"
  if(root=="nuisance") pi<-"fitzjohn"
  else if(root=="mle") pi<-"mle"
  else if(root=="flat") pi<-rep(1/levs,levs)
  if(hasArg(lik.func)) lik.func<-list(...)$lik.func
  else lik.func<-"pruning"
  if(hasArg(expm.method)) expm.method<-list(...)$expm.method
  else expm.method<-"R_Eigen"
  if(hasArg(p)) p<-list(...)$p
  else p<-4
  C<-vcv(tree)
  N<-Ntip(tree)
  ## continuous character
  x<-x[tree$tip.label]
  if(is.null(threshold)){
    threshold<-range(x)
    LIMS<-phytools:::expand.range(threshold,p=p)
  } else if(threshold[1]==-Inf&&threshold[2]==Inf){
    LIMS<-phytools:::expand.range(range(x),p=p)
    threshold<-LIMS
  } else if(threshold[2]==Inf){
    LIMS<-phytools:::expand.range(c(threshold[1],max(x)),p=p)
    threshold[2]<-LIMS[2]
  } else if(threshold[1]==-Inf){
    LIMS<-phytools:::expand.range(c(min[x],threshold[2]),p=p)
    threshold[1]<-LIMS[1]
  } else {
    LIMS<-phytools:::expand.range(threshold,p=p)
  }
  dd<-diff(LIMS)
  tol<-1e-8*dd/levs
  n2<-round(diff(threshold)/diff(LIMS)*levs)+1
  bb<-seq(threshold[1]+tol,threshold[2]-tol,length.out=n2)
  w<-diff(range(bb))/(n2-1)
  n1<-if(LIMS[1]<threshold[1]) 
    round((threshold[1]-LIMS[1])/diff(LIMS)*levs) else 0
  n3<-levs-n2-n1+1
  bb<-c(seq(to=threshold[1]-w,by=w,length.out=n1),
    bb,
    seq(from=threshold[2]+w,by=w,length.out=n3))
  bins<-cbind(bb[1:levs],bb[2:(levs+1)])
  X<-phytools:::to_binned(x,bins)
  ll<-which(bins[,1]<=threshold[1])
  uu<-which(bins[,2]>=threshold[2])
  for(i in 1:length(x)){
    if(x[i]<=threshold[1]) X[i,ll]<-1
    else if(x[i]>=threshold[2]) X[i,uu]<-1
  }
  MODEL<-matrix(0,ncol(X),ncol(X),
    dimnames=list(colnames(X),colnames(X)))
  for(i in 1:(nrow(MODEL)-1)) MODEL[i,i+1]<-MODEL[i+1,i]<-1
  pic_x<-pic(x,multi2di(tree))
  nn<-max(c(floor(0.2*multi2di(tree)$Nnode),1))
  q.init<-runif(n=1,0,2)*
    (1/2)*mean(sort(pic_x^2,decreasing=TRUE)[1:nn])*
    (levs/dd)^2
  if(hasArg(max.sigsq)) max.sigsq<-list(...)$max.sigsq
  else max.sigsq<-2*mean(sort(pic_x^2,decreasing=TRUE)[1:nn])
  max.q=(1/2)*max.sigsq*(levs/dd)^2
  mk_fit<-fitMk(tree,X,model=MODEL,pi=pi,lik.func=lik.func,
    expm.method=expm.method,logscale=TRUE,max.q=max.q,
    q.init=q.init)
  lik<-logLik(mk_fit)-Ntip(tree)*log(dd/levs)
  x0<-sum(mk_fit$pi*rowMeans(bins))
  sigsq<-2*mk_fit$rates*(dd/levs)^2
  object<-list(
    sigsq=sigsq,
    x0=x0,
    threshold=threshold,
    ncat=levs,
    logLik=lik,
    opt_results=mk_fit$opt_results,
    at_bounds=(sigsq>=(max.sigsq-tol)),
    mk_fit=mk_fit)
  class(object)<-"fitSemiThresh"
  object
}

print.fitSemiThresh<-function(x,digits=6,...){
  cat(paste("Object of class \"fitSemiThresh\" based on\n",
    "	a discretization with k =",
    x$ncat,"levels.\n"))
  cat(paste("Set or estimated thresholds: [",round(x$threshold[1],digits),
    ",",round(x$threshold[2],digits),"]\n\n"))
  cat("Fitted model parameters:\n")
  if(is.null(x$CI)) cat(paste("  sigsq:",round(x$sigsq,6),"\n"))
  else cat(paste("  sigsq:",round(x$sigsq,6),"  [",round(x$CI[1],4),
    ",",round(x$CI[2],4),"]\n"))
  cat(paste("     x0:",round(x$x0,6),"\n\n"))
  cat(paste("Log-likelihood:",round(x$logLik,digits),"\n\n"))
  if(x$opt_results$convergence == 0 && !x$at_bounds) 
    cat("R thinks it has found the ML solution.\n\n")
  else if(x$at_bounds) cat("Optimization may be at bounds.\n\n")
  else cat("R thinks optimization may not have converged.\n\n")
}

Let’s try it!

fitted_model<-fitSemiThresh(tree,y,threshold=c(0,Inf))
fitted_model
## Object of class "fitSemiThresh" based on
##  	a discretization with k = 200 levels.
## Set or estimated thresholds: [ 0 , 63.683218 ]
## 
## Fitted model parameters:
##   sigsq: 1.175643 
##      x0: 7.387253 
## 
## Log-likelihood: -194.625829 
## 
## R thinks it has found the ML solution.

Cool. These parameter estimates actually come out very close to our generating values of \(\sigma^2 = 1.25\) and \(x_0 = 5\).

We can contrast that with what we might obtain using geiger::fitContinuous under an assumption of simple, unbounded & un-thresholded Brownian motion. Ignoring the threshold, we’ll tend to underestimate \(\sigma^2\) compared to its generating value.

geiger::fitContinuous(tree,y)
## GEIGER-fitted comparative model of continuous data
##  fitted 'BM' model parameters:
## 	sigsq = 0.777791
## 	z0 = 8.761141
## 
##  model summary:
## 	log-likelihood = -241.213382
## 	AIC = 486.426764
## 	AICc = 486.582608
## 	free parameters = 2
## 
## Convergence diagnostics:
## 	optimization iterations = 100
## 	failed iterations = 0
## 	number of iterations with same best fit = 100
## 	frequency of best fit = 1.000
## 
##  object summary:
## 	'lik' -- likelihood function
## 	'bnd' -- bounds for likelihood search
## 	'res' -- optimization iteration summary
## 	'opt' -- maximum likelihood parameter estimates

It gets even worse when there are thresholds at the upper and lower boundaries of the traits.

Let’s create a trait this property using \(\sigma^2 = 0.3\), \(x_0 = 0.5\), and thresholds at \([0,1]\)

N<-200
tree<-pbtree(n=N,scale=1)
x<-fastBM(tree,a=0.5,sig2=0.3)
x[x<0]<-0
x[x>1]<-1
hist(x,
  main="distribution of semi-thresholded trait values",
  font.main=3,
  xlab="trait",breaks=seq(-0.025,1.025,by=0.05))

plot of chunk unnamed-chunk-110

fitted_model<-fitSemiThresh(tree,x,threshold=c(0,1),
  levs=200)
fitted_model
## Object of class "fitSemiThresh" based on
##  	a discretization with k = 200 levels.
## Set or estimated thresholds: [ 0 , 1 ]
## 
## Fitted model parameters:
##   sigsq: 0.285624 
##      x0: 0.61 
## 
## Log-likelihood: 117.02295 
## 
## R thinks it has found the ML solution.

Once again, we can contrast this with geiger::fitContinuous treating the character as if it was evolving without bounds.

geiger::fitContinuous(tree,x)
## GEIGER-fitted comparative model of continuous data
##  fitted 'BM' model parameters:
## 	sigsq = 0.210468
## 	z0 = 0.596543
## 
##  model summary:
## 	log-likelihood = 32.199314
## 	AIC = -60.398629
## 	AICc = -60.337715
## 	free parameters = 2
## 
## Convergence diagnostics:
## 	optimization iterations = 100
## 	failed iterations = 0
## 	number of iterations with same best fit = 100
## 	frequency of best fit = 1.000
## 
##  object summary:
## 	'lik' -- likelihood function
## 	'bnd' -- bounds for likelihood search
## 	'res' -- optimization iteration summary
## 	'opt' -- maximum likelihood parameter estimates

It’s an important point to make that this model has standard Brownian motion as a special case but in which the “thresholds” are set to \([-\infty, \infty]\).

Let me show that quickly here whilst increasing levs to levs=400….

x<-fastBM(tree<-pbtree(n=100))
fitSemiThresh(tree,x,threshold=c(-Inf,Inf),levs=400)
## Object of class "fitSemiThresh" based on
##  	a discretization with k = 400 levels.
## Set or estimated thresholds: [ -18.135385 , 20.253821 ]
## 
## Fitted model parameters:
##   sigsq: 0.790013 
##      x0: 1.58707 
## 
## Log-likelihood: -121.809603 
## 
## R thinks it has found the ML solution.
geiger::fitContinuous(tree,x)
## GEIGER-fitted comparative model of continuous data
##  fitted 'BM' model parameters:
## 	sigsq = 0.780540
## 	z0 = 1.564920
## 
##  model summary:
## 	log-likelihood = -121.335715
## 	AIC = 246.671429
## 	AICc = 246.795141
## 	free parameters = 2
## 
## Convergence diagnostics:
## 	optimization iterations = 100
## 	failed iterations = 0
## 	number of iterations with same best fit = 100
## 	frequency of best fit = 1.000
## 
##  object summary:
## 	'lik' -- likelihood function
## 	'bnd' -- bounds for likelihood search
## 	'res' -- optimization iteration summary
## 	'opt' -- maximum likelihood parameter estimates

As always under the discretized diffusion approximation, this similarity should go up for increasing values of levs, but at what can be very significant computational cost!

Finally, just as with bounded_bm, fitThresh will always tend to explain our data better than unbounded Brownian motion for thresholds (or bounds) set to the observable limits of our trait.

That’s because the effect is to “fold” the tail of the probability density back in at the edges, making our most extreme observations more probable.

As such, this model should be fit to data when we have genuine reason to suspect that our trait is evolving under a semi-threshold property – either on logical grounds, or based on the distribution of observed values among our tip species. In the latter case it could then be competed against alternative models with similar among-species trait value distribution, such as the reflective bounded model or a continuous character model with sticky bounds (not yet implemented, but not hard to imagine).

Look for more on this soon!