Thursday, August 15, 2024

Fitting a multi-state threshold model using fitThresh in phytools when some tip states are uncertain

Yesterday a phytools user asked if it was possible to supply input data for fitThresh as a matrix of (prior) probabilities that each tip was in each state of the character. fitThresh, for those who don’t recall, is a function to fit the multi-state threshold model using the discrete approximation of Boucher & DÃ©mery (2016). (If this is new to you, check out my Evolution Meeting 2024 talk entitled Bounded Brownian motion & the most important phylogenetic comparative methods paper you’ve probably never read to learn more. It’s on YouTube!)

As a reminder, and to make a nice graphic, evolution under the threshold model looks something like this:

library(phytools)
layout(matrix(c(1,2),1,2),widths=c(0.55,0.45))
par(mar=c(5.1,4.1,1.1,1.1),las=1,cex.axis=0.8,bg="#2f2f2f",
fg="white",col.axis="white",col.lab="white")
obj<-bmPlot(pbtree(b=0.03,n=80,t=100,type="discrete",
quiet=TRUE),type="threshold",thresholds=c(0,0.5,2,2.5),
anc=1,sig2=1/100,ngen=100,colors=hcl.colors(5,
palette="plasma"),return.tree=TRUE,bty="n")
plot(obj$tree,colors=setNames(hcl.colors(5,palette="plasma"), c("a","b","c","d","e")),direction="upwards",ftype="off", mar=c(5.1,1.1,1.1,1.1))  The answer, at the time, was no; however, I have sinced pushed two updates to the function (1, 2) that fix this. Followers of this blog might recall from an earlier post that there are effectively two ways to handle tip state uncertainty in discrete character phylogenetic models. In my previous post, I characterized these as follows: (1) as if the uncertainty is observed (e.g., we see that the base at a particular nucleotide site is a purine, but we can’t tell whether it’s an adenine or a guanine); or (2) as if the uncertainty is unobserved (e.g., we see a 0 for one of our species, but there’s, say, a 10% chance that we might’ve measured wrongly & it’s actually a 1). In the former case, the total probability of our data is equal to the sum of the probabilities of the data given that the ambiguous site is an adenine or a guanine. In the latter, we evaluate the product of the probability of our observed data and the chances that the possibly erroneously measured datum was 0 (0.9) and add to it the probability of our data but in which the ambiguous datum is 1, multiplied by the chances that it is 1 (0.1). This all sounds a bit convoluted, but as I describe in my previous post, it makes less of a difference than one might think, so I’d argue that it’s probably most sensible to take the approach that best reflects the type of uncertainty we have in our data! It was a tiny bit complicated (for reasons that I won’t get into), but I’ve now implemented a similar approach for fitThresh – that is, users can supply their input in the form of an $$N \times k$$ matrix (for N species and k levels of the discrete trait), in which ambiguity is coded either as 1.0s for all possible states (i.e., approach #1, above) and/or by indicating the prior probability that each tip is in each of the possible states of the trait (i.e., approach #2). To illustrate this, I’m going to begin by simulating data under the threshold model. Remember, this simply involves generating liability values (the unobserved, underlying continuous trait) first, say, under a model of Brownian motion evolution – and then converting the simulated species value of the trait to our discrete threshold character (which can be done using the mostly internally-used function, phytools::threshState). tree<-pbtree(n=300,scale=1) liability<-fastBM(tree,a=1.5) thresholds<-setNames(c(0,0.5,2,2.5,Inf), c("a","b","c","d","e")) x<-as.factor(threshState(liability,thresholds)) head(x,10)  ## t222 t223 t107 t299 t300 t120 t159 t160 t110 t111 ## c c c c c c b a a a ## Levels: a b c d e  Now, let’s proceed to fit our threshold model simply by passing tree and our factor x directly to fitThresh. (This is the only form that data for fitThresh could assume before today’s updates.) fit1<-fitThresh(tree,x) fit1  ## Object of class "fitThresh". ## ## Set value of sigsq (of the liability) = 1.0 ## ## Set or estimated threshold(s) = ## [ -1.402909, -0.893271, 0.633884, 1.245576 ]* ## ## Log-likelihood: -224.719748 ## ## (*lowermost threshold is fixed)  Keeping in mind that the lowermost threshold is arbitrary, if we reset it to 0.0 (our simulated lower threshold) we should find that the estimate thresholds are quite accurate compared to their generating values. ## generating thresholds  ## a b c d e ## 0.0 0.5 2.0 2.5 Inf  ## estimated fit1$threshold-min(fit1$threshold)  ## [1] 0.0000000 0.5096379 2.0367932 2.6484848  (This is nothing new, of course.) Next, I’m going to convert the vector x to a simple binary matrix & show that the exact same result will be obtained (excepting for differences in numerical optimization). X<-to.matrix(x,levels(x)) head(X,10)  ## a b c d e ## t222 0 0 1 0 0 ## t223 0 0 1 0 0 ## t107 0 0 1 0 0 ## t299 0 0 1 0 0 ## t300 0 0 1 0 0 ## t120 0 0 1 0 0 ## t159 0 1 0 0 0 ## t160 1 0 0 0 0 ## t110 1 0 0 0 0 ## t111 1 0 0 0 0  fit2<-fitThresh(tree,X) fit2  ## Object of class "fitThresh". ## ## Set value of sigsq (of the liability) = 1.0 ## ## Set or estimated threshold(s) = ## [ -1.402909, -0.893012, 0.634471, 1.257801 ]* ## ## Log-likelihood: -224.721515 ## ## (*lowermost threshold is fixed)  So far, so good. Next, to emulate uncertainty in some tip states, I’ll pick a random set of 30 tips (10% of the taxa in my tree), and set them to complete ambiguity. To do that, I’ll assign the corresponding rows a set of five 1.0s across the five possible values of the trait. X[ii<-sample(1:nrow(X),30),]<-rep(1,5) head(X,10)  ## a b c d e ## t222 0 0 1 0 0 ## t223 0 0 1 0 0 ## t107 0 0 1 0 0 ## t299 0 0 1 0 0 ## t300 0 0 1 0 0 ## t120 0 0 1 0 0 ## t159 1 1 1 1 1 ## t160 1 0 0 0 0 ## t110 1 0 0 0 0 ## t111 1 0 0 0 0  Now let’s fit our model with these data & see how the fitted model compares. In this case, since they are based on less data, we might expect our parameter estimates to be slightly less accurate than for the full dataset (although, on average, they should remain unbiased – something that will be impossible to evaluate from a single simulation). fit3<-fitThresh(tree,X) fit3  ## Object of class "fitThresh". ## ## Set value of sigsq (of the liability) = 1.0 ## ## Set or estimated threshold(s) = ## [ -1.402909, -0.874436, 0.634451, 1.22128 ]* ## ## Log-likelihood: -205.881759 ## ## (*lowermost threshold is fixed)  Out of curiousity, let’s check how well we’d do in predicting the values for are now “missing” terminal taxa based on our fitted model. To do that, I’ll run ancr with tips=TRUE on our fitted model and then compute the mean square product of our original data and the reconstructed tip values, subsampling to include only those set as ambiguous in our analysis. anc_fit3<-ancr(fit3,tips=TRUE) anc_fit3  ## Marginal ancestral state estimates: ## a b c d e ## 301 0.004163 0.279245 0.716592 0.000000 0e+00 ## 302 0.135957 0.656351 0.207692 0.000000 0e+00 ## 303 0.097724 0.488095 0.414179 0.000002 0e+00 ## 304 0.101190 0.479178 0.419629 0.000003 0e+00 ## 305 0.002069 0.080525 0.913791 0.003608 8e-06 ## 306 0.000000 0.018144 0.976987 0.004869 0e+00 ## ... ## ## Log-likelihood = -205.881759  sum(anc_fit3$ace[ii,]*
to.matrix(x,levels(x))[ii,])/length(ii)

## [1] 0.6294646


This is a crude measure of accuracy, but shows that our marginal reconstructions are about 63% “correct” – meaning we might have gotten around ⅔ of them correct with 100% confidence, all of them right with 63% confidence, or, most likely, something in between. This is pretty good, given that we gave the method no information about the tip states, the trait changes rapidly in some parts of the tree, and we’d expect to get these values right only 20% of the time on average by random chance.

Lastly, let’s emulate “partial” uncertainty. In this case, I’ll go through the same 30 randomly selected tips, and each time assign an equal prior probability that the tip is in the known, true condition – or in one of its up to two neighboring values.

X<-to.matrix(x,levels(x))
for(i in 1:length(ii)){
jj<-which(X[ii[i],]==1)
ind<-c(jj,sample(jj+c(-1,1),1))
while(any(ind<1)||any(ind>5))
ind<-c(jj,sample(jj+c(-1,1),1))
X[ii[i],ind]<-c(0.5,0.5)
}


Let’s proceed to fit the model. In this case, we should expect our result to even more closely resemble fit1 and fit2 from the full data because we are only partially (rather than fully) ambiguating our 30 termini.

fit4<-fitThresh(tree,X)
fit4

## Object of class "fitThresh".
##
##     Set value of sigsq (of the liability) = 1.0
##
## 	  Set or estimated threshold(s) =
##          [ -1.402909, -0.863932, 0.657881, 1.268888 ]*
##
##     Log-likelihood: -233.248294
##
## (*lowermost threshold is fixed)


Let’s once again measure the predictive value of this model by seeing how it allows ancr to predict the values of our ambiguous tip states.

As before, we shouldn’t be surprised to find that this prediction is more accurate than from our fit3 model object because our tip states were less ambiguous in estimation.

anc_fit4<-ancr(fit4,tips=TRUE)
sum(anc_fit4$ace[ii,]* to.matrix(x,levels(x))[ii,])/length(ii)  ## [1] 0.7410549  Finally, let’s go back to the same dataset we used for fit4, and disambiguate it further still be assigning the known, true state a prior probability of 0.75 and the wrong state a probability of 0.25. A quick trick to do this using our previously calculated matrix, X, is as follows. X[ii,]<-X[ii,]/2+to.matrix(x,levels(x))[ii,]/2 head(X,20)  ## a b c d e ## t222 0 0.00 1.00 0 0 ## t223 0 0.00 1.00 0 0 ## t107 0 0.00 1.00 0 0 ## t299 0 0.00 1.00 0 0 ## t300 0 0.00 1.00 0 0 ## t120 0 0.00 1.00 0 0 ## t159 0 0.75 0.25 0 0 ## t160 1 0.00 0.00 0 0 ## t110 1 0.00 0.00 0 0 ## t111 1 0.00 0.00 0 0 ## t31 0 0.00 1.00 0 0 ## t251 0 0.00 1.00 0 0 ## t252 0 0.00 1.00 0 0 ## t100 0 0.00 1.00 0 0 ## t249 0 1.00 0.00 0 0 ## t250 0 1.00 0.00 0 0 ## t118 0 0.00 1.00 0 0 ## t119 0 0.00 1.00 0 0 ## t8 1 0.00 0.00 0 0 ## t224 1 0.00 0.00 0 0  (Don’t ask me to explain how that works.) OK, now let’s fit our model! fit5<-fitThresh(tree,X) fit5  ## Object of class "fitThresh". ## ## Set value of sigsq (of the liability) = 1.0 ## ## Set or estimated threshold(s) = ## [ -1.402909, -0.868399, 0.657958, 1.26904 ]* ## ## Log-likelihood: -227.355496 ## ## (*lowermost threshold is fixed)  Compare this result to fit1 or fit2. It should be closer on average. Is it? Finally, let’s do the tip prediction thing again: anc_fit5<-ancr(fit5,tips=TRUE) sum(anc_fit5$ace[ii,]*
to.matrix(x,levels(x))[ii,])/length(ii)

## [1] 0.8618047


Readers should keep in mind, of course, that the likelihoods are not comparable across any of these different analyses (apart from the first two) since each one use different data for our species!

That’s all I have to say about this for now, but I hope to be posting more about using the discrete approximation of Boucher & DÃ©mery (2016) to fit interesting models in phylogenetic comparative biology very soon!