## Monday, May 20, 2024

### A bit more about fitting the multi-state threshold model using the discrete approximation

Just the other day, I posted about fitting a multi-state threshold model using the discrete approximation of Boucher & DÃ©mery (2016). This has now been added to the development version of phytools which can be installed from GitHub using (e.g.) the remotes package.

I thought it might be worth following up with a short exploration of some of the statistical properties of this implementation. In particular, I wondered if it might be interesting to ask: (1) can we distinguish evolution under the threshold model from (say) evolution under an ordered Mk process (and vice versa)?; (2) since this model is inherently ordered, can we identify the correct character order from our data?; and, finally (3) does using this model (compared to, for instance, an ordered Mk model) lead to better prediction (say, of ancestral states)? I’m hoping that the answer to each of these is “yes” – but as I write this, I’ve not yet tested any of these questions!

Before we begin, as a quick reminder, the threshold model is one in which the observed trait is discretely valued, but it’s underlain by an unobserved continuous character called ‘liability.’ Every time that this unobserved trait liability crosses a certain predetermined (but unknown) threshold, the discrete character changes state. Evolution under the threshold model on a phylogeny might look more or less as shown below, in which the left panel shows the continuous evolution of liability, while the left panel gives the evolution of our corresponding discrete trait.

library(phytools)
layout(matrix(c(1,2),1,2),widths=c(0.53,0.47))
par(mar=c(5.1,4.1,1.1,1.1),las=1,cex.axis=0.8)
obj<-bmPlot(pbtree(b=0.03,n=100,t=100,
type="discrete",quiet=TRUE),
type="threshold",thresholds=c(0,1.5,2.4),
anc=1.7,sig2=1/100,
ngen=100,colors=hcl.colors(4),return.tree=TRUE,
bty="n")
plot(obj$tree,colors=setNames(hcl.colors(4), c("a","b","c","d")), direction="upwards",ftype="off", mar=c(5.1,1.1,1.1,1.1))  ### (1) Can we identify evolution under the threshold model? To test whether or not we could distinguish evolution under the threshold model from (e.g.) an ordered Mk model, we can begin by simulating a set of threshold traits. The easiest way to do that is just to first simulate a continuous character under Brownian motion (our liabilities; e.g., using phytools::fastBM) and then continue to “thresholding” it – that is, converting each tip value to a threshold discrete character using a set of pre-specified threshold values. Let’s do twenty simulations, each involving a stochastic pure-birth tree of 200 taxa and a single threshold character with four trait levels and thresholds $$[0, 0.8, 2]$$, as well as a random ancestral liability sampled uniformly on the interval $$[-1, 3]$$. ## our threshold function thresh<-function(liability){ foo<-function(x) if(x<0) "a" else if(x>0&&x<0.8) "b" else if(x>0.8&&x<2) "c" else "d" sapply(liability,foo) } ## simulated trees trees<-pbtree(n=200,nsim=20,scale=3) ## simulated liabilities anc<-runif(n=length(trees),min=-1,max=3) liabilities<-mapply(fastBM,tree=trees,a=anc, SIMPLIFY=FALSE) ## threshold traits x1<-lapply(liabilities,thresh)  Now let’s use the new function phytools::fitThresh to fit the threshold model to each of these twenty datasets. To speed this along a teensie bit, I’m going to run these different optimizations in parallel use the cool foreach package. ## load packages library(foreach) library(doParallel) ## set up cluster ncores<-min(length(trees),detectCores()-1) mc<-makeCluster(ncores,type="PSOCK") registerDoParallel(cl=mc) ## run optimizations thresh_fits<-foreach(i=1:length(trees))%dopar%{ phytools::fitThresh(trees[[i]],x1[[i]]) }  Now let’s do the same thing – but this time fitting the Mk model. To give our Mk model the best chance of “beating” the threshold model, I’m going to design it with two different forms: one in which the back & forth rates of change between adjacent levels are symmetric, and a second in which they can be different. sym_model<-matrix(c( 0,1,0,0, 1,0,2,0, 0,2,0,3, 0,0,3,0),4,4) sym_fits<-foreach(i=1:length(trees))%dopar%{ phytools::fitMk(trees[[i]],x1[[i]], model=sym_model,pi="fitzjohn") }  asym_model<-matrix(c( 0,1,0,0, 2,0,3,0, 0,4,0,5, 0,0,6,0),4,4) asym_fits<-foreach(i=1:length(trees))%dopar%{ phytools::fitMk(trees[[i]],x1[[i]], model=asym_model,pi="fitzjohn") }  Alright, now we should be ready to compare among our different models. To do this, I’ll simply run anova on our set of three models for each tree and dataset, and then extract the Akaike model weights for each of the three. invisible(capture.output( aov_thresh<-mapply(anova,thresh_fits,sym_fits,asym_fits, SIMPLIFY=FALSE))) thresh_w<-t(sapply(aov_thresh, function(x) x$weight))
colnames(thresh_w)<-c("threshold","Mk(sym)",
"Mk(asym)")
round(thresh_w,4)

##       threshold Mk(sym) Mk(asym)
##  [1,]    0.3284  0.0000   0.6715
##  [2,]    0.9395  0.0007   0.0598
##  [3,]    0.3042  0.1700   0.5258
##  [4,]    0.9756  0.0140   0.0104
##  [5,]    0.9632  0.0168   0.0201
##  [6,]    0.8434  0.0293   0.1273
##  [7,]    0.8333  0.0015   0.1652
##  [8,]    0.7404  0.0229   0.2367
##  [9,]    0.9664  0.0076   0.0260
## [10,]    0.9332  0.0246   0.0422
## [11,]    0.0138  0.0000   0.9862
## [12,]    0.9978  0.0016   0.0006
## [13,]    0.6799  0.0446   0.2755
## [14,]    0.9340  0.0097   0.0564
## [15,]    0.9240  0.0120   0.0640
## [16,]    0.8817  0.1105   0.0078
## [17,]    0.5735  0.2167   0.2098
## [18,]    0.9373  0.0010   0.0617
## [19,]    0.6550  0.0131   0.3320
## [20,]    0.8158  0.0006   0.1836


Based on this small experiment, it looks like we’re doing pretty good – in the overwhelming majority of our 20 simulations, the threshold model (our generating model, recall) is best-supported.

Just for fun, and outside of our main hypothesis of interest, let’s also measure the relative positions of the thresholds between states. Remember, as I discussed before, the position of one of the threshold is arbitrary – so we can force it to be zero by subtracting the minimum.

Our generating thresholds, don’t forget, were $$[0, 0.8, 2]$$.

est_thresholds<-t(sapply(thresh_fits,function(x)
setNames(x$threshold,c("a<->b","b<->c","c<->d"))- min(x$threshold)))
est_thresholds

##       a<->b     b<->c    c<->d
##  [1,]     0 0.7582723 2.127434
##  [2,]     0 0.8538812 2.312197
##  [3,]     0 0.8572755 1.972028
##  [4,]     0 0.8578116 1.885027
##  [5,]     0 0.8168484 1.882489
##  [6,]     0 0.8111254 2.063263
##  [7,]     0 0.7000879 2.149859
##  [8,]     0 0.7241787 1.959922
##  [9,]     0 0.8824466 1.933995
## [10,]     0 0.7094864 1.944646
## [11,]     0 0.8524495 1.550285
## [12,]     0 0.8588505 2.178525
## [13,]     0 0.8124073 1.777832
## [14,]     0 1.0014537 2.052370
## [15,]     0 0.6933942 1.937796
## [16,]     0 0.7641322 2.007600
## [17,]     0 0.8346238 2.181587
## [18,]     0 1.2150154 2.532655
## [19,]     0 0.7155396 1.822369
## [20,]     0 0.6974469 1.881261


That’s pretty satisfying, I think.

Now let’s try the inverse experiment. In this case, we can simulate under an ordered Mk model (here, I’ll use a symmetric such model – for the best chance of confusing fitThresh) and then estimate under the same three models as earlier.

## set Q for simulation
Q<-matrix(c(
0.0,1.0,0.0,0.0,
1.0,0.0,0.5,0.0,
0.0,0.5,0.0,1.2,
0.0,0.0,1.2,0.0),4,4,byrow=TRUE,
dimnames=list(c("a","b","c","d"),
c("a","b","c","d")))
diag(Q)<--rowSums(Q)
## simulate data
x2<-lapply(trees,sim.Mk,Q=Q)
## fit models
thresh_fits<-foreach(i=1:length(trees))%dopar%{
phytools::fitThresh(trees[[i]],x2[[i]])
}
sym_model<-matrix(c(
0,1,0,0,
1,0,2,0,
0,2,0,3,
0,0,3,0),4,4)
sym_fits<-foreach(i=1:length(trees))%dopar%{
phytools::fitMk(trees[[i]],x2[[i]],
model=sym_model,pi="fitzjohn")
}
asym_model<-matrix(c(
0,1,0,0,
2,0,3,0,
0,4,0,5,
0,0,6,0),4,4)
asym_fits<-foreach(i=1:length(trees))%dopar%{
phytools::fitMk(trees[[i]],x2[[i]],
model=asym_model,pi="fitzjohn")
}


We can then proceed to compare among our different models in much the same way we did before.

invisible(capture.output(
aov_mk<-mapply(anova,thresh_fits,sym_fits,
asym_fits,SIMPLIFY=FALSE)))
thresh_w<-t(sapply(aov_mk,function(x) x$weight)) colnames(thresh_w)<-c("threshold","Mk(sym)", "Mk(asym)") round(thresh_w,4)  ## threshold Mk(sym) Mk(asym) ## [1,] 0.0072 0.9128 0.0800 ## [2,] 0.0000 0.9105 0.0895 ## [3,] 0.0000 0.8446 0.1554 ## [4,] 0.0001 0.9324 0.0676 ## [5,] 0.0009 0.9256 0.0735 ## [6,] 0.0000 0.9455 0.0545 ## [7,] 0.0000 0.9103 0.0896 ## [8,] 0.0000 0.8968 0.1032 ## [9,] 0.0001 0.9159 0.0840 ## [10,] 0.0002 0.9401 0.0597 ## [11,] 0.0002 0.8606 0.1392 ## [12,] 0.0000 0.8263 0.1737 ## [13,] 0.0000 0.8923 0.1077 ## [14,] 0.0000 0.8511 0.1489 ## [15,] 0.0000 0.9138 0.0862 ## [16,] 0.0054 0.9298 0.0647 ## [17,] 0.0000 0.8599 0.1401 ## [18,] 0.0000 0.9470 0.0529 ## [19,] 0.0000 0.8207 0.1792 ## [20,] 0.0000 0.9315 0.0685  Neat. This shows us that in pretty much every instance either the generating model (labeled "Mk(sym)" in the table), or the asymmetric Mk model ("Mk(asym)"), captures almost all the model weight – and the threshold model virtually none. Finally, we shouldn’t forget to stop our cluster so that it doesn’t continue to use up system resources. Let’s do that now. (We’ll open up a new cluster later, but it will have a different number of cores.) ## stop cluster stopCluster(cl=mc)  ### (2) Can we find the correct trait order from our data? Our second question was – since the multi-state threshold model is inherently ordered – can we correctly estimate this order from the data? Even as I explore this, I should also note that it seems likely that under the circumstances in which the threshold model is hypothesized, the trait order would also be known by the investigator. Nonetheless, the question seemed worth asking. To do this, I’ll go back to my first set of simulated data (x1) with true order a <-> b <-> c <-> d, fit each of the 12 possible trait orders of four states (keeping in mind that the sequences [a, b, c, d] and [d, c, b, a] are exactly equivalent). ## get all unique orders levs<-c("a","b","c","d") perms<-gtools::permutations(4,4,levs) for(i in 2:4){ ii<-which(perms[,1]==levs[i]) jj<-which(perms[,4]%in%levs[i:1]) perms<-perms[-intersect(ii,jj),] }  Let’s use parallelization with foreach again, just to cut down on some of the computation time. We’re fitting a lot of models! (12 $$\times$$ 20, in fact.) ## set up cluster ncores<-min(nrow(perms),detectCores()-1) mc<-makeCluster(ncores,type="PSOCK") registerDoParallel(cl=mc) ## fit all 12 models to each dataset all_fits<-list() for(j in 1:length(trees)){ all_fits[[j]]<-foreach(i=1:nrow(perms))%dopar%{ phytools::fitThresh(trees[[j]],x1[[j]], sequence=perms[i,]) } } stopCluster(cl=mc)  To see whether or not we tend to choose the generating model ([a, b, c, d]), let’s compute Akaike weights across each of our set of 12 models for each simulated tree and dataset. weights<-function(x) aic.w(do.call(AIC,x)$AIC)
perm_weights<-t(sapply(all_fits,weights))
colnames(perm_weights)<-apply(perms,1,paste,
collapse=",")
round(perm_weights,3)

##       a,b,c,d a,b,d,c a,c,b,d a,c,d,b a,d,b,c a,d,c,b b,a,c,d b,a,d,c b,c,a,d b,d,a,c c,a,b,d c,b,a,d
##  [1,]   0.117   0.883       0       0       0       0   0.000       0   0.000       0       0       0
##  [2,]   0.856   0.000       0       0       0       0   0.142       0   0.002       0       0       0
##  [3,]   1.000   0.000       0       0       0       0   0.000       0   0.000       0       0       0
##  [4,]   1.000   0.000       0       0       0       0   0.000       0   0.000       0       0       0
##  [5,]   1.000   0.000       0       0       0       0   0.000       0   0.000       0       0       0
##  [6,]   1.000   0.000       0       0       0       0   0.000       0   0.000       0       0       0
##  [7,]   0.793   0.207       0       0       0       0   0.000       0   0.000       0       0       0
##  [8,]   1.000   0.000       0       0       0       0   0.000       0   0.000       0       0       0
##  [9,]   1.000   0.000       0       0       0       0   0.000       0   0.000       0       0       0
## [10,]   1.000   0.000       0       0       0       0   0.000       0   0.000       0       0       0
## [11,]   1.000   0.000       0       0       0       0   0.000       0   0.000       0       0       0
## [12,]   1.000   0.000       0       0       0       0   0.000       0   0.000       0       0       0
## [13,]   1.000   0.000       0       0       0       0   0.000       0   0.000       0       0       0
## [14,]   0.953   0.000       0       0       0       0   0.000       0   0.047       0       0       0
## [15,]   1.000   0.000       0       0       0       0   0.000       0   0.000       0       0       0
## [16,]   1.000   0.000       0       0       0       0   0.000       0   0.000       0       0       0
## [17,]   1.000   0.000       0       0       0       0   0.000       0   0.000       0       0       0
## [18,]   0.999   0.000       0       0       0       0   0.000       0   0.001       0       0       0
## [19,]   1.000   0.000       0       0       0       0   0.000       0   0.000       0       0       0
## [20,]   1.000   0.000       0       0       0       0   0.000       0   0.000       0       0       0


OK – the picture couldn’t really be too much more persuasive than that: in nearly all cases, the generating order ([a, b, c, d]) is favored compared to all other alternatives. Clearly, we can estimate the order of our threshold character trait if it happened to be unknown.

### (3) Does using the threshold model lead to better prediction?

Finally, we asked whether using the threshold model (when true) leads to better overall prediction – in our case, of ancestral states.

To investigate this, I’ll have to generate a new set of simulations because in our previous simulations we didn’t save the ancestral values at internal nodes. Here, I’m simply simulating tip and node liabilities, and then thresholding both to obtain the discrete characters values at both tips and internal nodes.

## our threshold function
thresh<-function(liability){
foo<-function(x) if(x<0) "a" else
if(x>0&&x<0.8) "b" else
if(x>0.8&&x<2) "c" else "d"
sapply(liability,foo)
}
## simulated liabilities
anc<-runif(n=length(trees),min=-1,max=3)
liabilities<-mapply(fastBM,tree=trees,a=anc,
internal=TRUE,SIMPLIFY=FALSE)
## threshold traits
x3_all<-lapply(liabilities,thresh)
x3<-mapply(function(x,t) x[t$tip.label],x3_all,trees, SIMPLIFY=FALSE)  To measure accuracy, I’ll now compute the mean squared difference between a binary representation of the true ancestral states, and our marginal reconstruction obtained using ancr. This measure has the property of going towards 0.0 when we are confidently correct about all states – and 1.0 when precisely the converse is true! anc_accuracy<-function(t,x,truth){ model<-matrix(c( 0,1,0,0, 1,0,2,0, 0,2,0,3, 0,0,3,0),4,4) mk_fit<-phytools::fitMk(t,x,model=model, pi="fitzjohn") mk_anc<-phytools::ancr(mk_fit) th_fit<-phytools::fitThresh(t,x) th_anc<-phytools::ancr(th_fit) truth<-phytools::to.matrix(truth, c("a","b","c","d")) setNames(c(mean((mk_anc$ace-truth)^2),
mean((th_anc$ace-truth)^2)), c("Mk","threshold")) }  x3_nodes<-mapply(function(t,x) x[1:Nnode(t)+Ntip(t)], t=trees,x=x3_all,SIMPLIFY=FALSE)  ## set up cluster ncores<-min(length(trees),detectCores()-1) mc<-makeCluster(ncores,type="PSOCK") registerDoParallel(cl=mc) ## compute accuracy Accuracy_thresh<-foreach(i=1:length(trees), .combine="rbind")%dopar%{ anc_accuracy(trees[[i]],x3[[i]], x3_nodes[[i]]) } Accuracy_thresh  ## Mk threshold ## result.1 0.08303919 0.08001440 ## result.2 0.09144049 0.07504564 ## result.3 0.07318017 0.06159575 ## result.4 0.06011879 0.05886326 ## result.5 0.07852081 0.07311473 ## result.6 0.07133398 0.05872539 ## result.7 0.05811679 0.05366009 ## result.8 0.05895149 0.05127617 ## result.9 0.04430315 0.03437346 ## result.10 0.07640328 0.07298250 ## result.11 0.02883655 0.02296869 ## result.12 0.04131325 0.05000457 ## result.13 0.04349993 0.03716758 ## result.14 0.07020538 0.05577830 ## result.15 0.08552298 0.06598748 ## result.16 0.04279199 0.04044518 ## result.17 0.09393248 0.09155821 ## result.18 0.09818441 0.09003129 ## result.19 0.04045814 0.03930287 ## result.20 0.07415361 0.06781260  Looking across our twenty simulations, the general tendency is clear that estimating under the threshold model leads to greater accuracy (i.e., lower values for our measure) of ancestral states. Just as we did previously, let’s also undertake the inverse analysis – simulating under an ordered Mk model, and then estimating both using the threshold model and the generating Mk model. x4_all<-lapply(trees,sim.Mk,Q=Q,internal=TRUE) x4<-mapply(function(x,t) x[t$tip.label],x4_all,trees,
SIMPLIFY=FALSE)
x4_nodes<-mapply(function(t,x) x[1:Nnode(t)+Ntip(t)],
t=trees,x=x4_all,SIMPLIFY=FALSE)
## compute accuracy
Accuracy_mk<-foreach(i=1:length(trees),
.combine="rbind")%dopar%{
anc_accuracy(trees[[i]],x4[[i]],x4_nodes[[i]])
}
Accuracy_mk

##                   Mk  threshold
## result.1  0.09544990 0.10864622
## result.2  0.10911823 0.09741020
## result.3  0.05555238 0.05433832
## result.4  0.07309050 0.07196058
## result.5  0.07386274 0.07274954
## result.6  0.07163266 0.08529978
## result.7  0.09660137 0.09858737
## result.8  0.08574490 0.09449920
## result.9  0.11850211 0.11599140
## result.10 0.08458749 0.08851898
## result.11 0.08366240 0.07809681
## result.12 0.07225887 0.08339927
## result.13 0.05950763 0.07542953
## result.14 0.05242473 0.04677409
## result.15 0.06903981 0.07948580
## result.16 0.07768083 0.08501524
## result.17 0.06795849 0.08488134
## result.18 0.05634197 0.06919138
## result.19 0.11093754 0.10862184
## result.20 0.06982347 0.07322193


In this case, although the difference is not as stark, we should see that accuracy is slightly better (i.e., our measure is lower) for the Mk model compared to our threshold function. (Actually, this is true of only slightly more than half of our simulations. I’m willing to bet that this difference would only increase for larger trees and for trait evolution models that deviate even more strongly from the threshold model.)

Very neat stuff, IMO!

We can’t forget to stop our cluster.

stopCluster(cl=mc)