Friday, November 28, 2025

Uncertain input data for hidden-rates model in fitHRM

A couple of days ago a phytools user contacted me reporting that he was unable to incorporate tip uncertainty in the standard way when fitting a hidden-rates model using fitHRM.

He’s not mistaken as I had not (previously) added that feature to fitHRM; however, since fitHRM just uses fitMk internally, it has always possible to fit a hidden-rates model with tip-state uncertainty using phytools by creating the model design matrix manually & fitting the model using fitMk.

That said, this seemed like a useful, and relatively easy, feature to add to the function – so I did it the same afternoon. Let’s test it using a quick simulation….

library(phytools)
packageVersion("phytools")
## [1] '2.5.3'

I’ll start by simulating a single tree with 1,000 taxa. Hidden-rate models are pretty data hungry, so it won’t hurt to start with a large tree!

tree<-pbtree(n=1000,scale=1)
tree
## 
## Phylogenetic tree with 1000 tips and 999 internal nodes.
## 
## Tip labels:
##   t672, t911, t912, t749, t667, t668, ...
## 
## Rooted; includes branch length(s).

Next, I’ll generate data for a binary trait with a hidden attribute (let’s call it "+" and "-") that affect the rate of transition between the two levels of our observed character.

levs<-c("0+","0-","1+","1-")
Q<-matrix(c(
  -1.25,0.25,1,0,
  0.25,-0.35,0,0.1,
  1,0,-1.25,0.25,
  0,0.1,0.25,-0.35),4,4,byrow=TRUE,
  dimnames=list(levs,levs))
Q
##       0+    0-    1+    1-
## 0+ -1.25  0.25  1.00  0.00
## 0-  0.25 -0.35  0.00  0.10
## 1+  1.00  0.00 -1.25  0.25
## 1-  0.00  0.10  0.25 -0.35

According to this simulation design, character levels "0+" and "1+" have a high rate of transition in the observed character, while character levels "0-" and "1-" have a low evolutionary rate.

Now simulate:

xx<-sim.Mk(tree,Q)
head(xx,20)
## t672 t911 t912 t749 t667 t668 t183 t563 t564 t233 t234 t305 t306 t678 t805 t806 t216 t307 t308 
##   1-   1-   1-   1-   1-   1-   1-   1-   1-   1-   1-   1-   1-   1-   1-   1-   1-   1-   1- 
## t200 
##   1- 
## Levels: 0- 0+ 1- 1+

So far so good.

Next, we can merge "0-" and "0+" into the state "0" as well as "1-" and "1+" into the state "1" as follows.

x<-vector()
x[grep("0",xx)]<-"0"
x[grep("1",xx)]<-"1"
names(x)<-names(xx)
x<-as.factor(x)
head(x,20)
## t672 t911 t912 t749 t667 t668 t183 t563 t564 t233 t234 t305 t306 t678 t805 t806 t216 t307 t308 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
## t200 
##    1 
## Levels: 0 1

Now, let’s fit our model using this factor vector as input without any uncertainty in our trait.

hrm_fit1<-fitHRM(tree,x,model="ER",ncat=2,
  niter=5,parallel=FALSE)
## 
## This is the design matrix of the fitted model.
## Does it make sense?
## 
##    0 0* 1 1*
## 0  0  3 1  0
## 0* 3  0 0  2
## 1  1  0 0  3
## 1* 0  2 3  0
## 
## log-likelihood from current iteration: -167.4157 
##  --- Best log-likelihood so far: -167.4157 ---
## log-likelihood from current iteration: -167.4157 
##  --- Best log-likelihood so far: -167.4157 ---
## log-likelihood from current iteration: -167.4157 
##  --- Best log-likelihood so far: -167.4157 ---
## log-likelihood from current iteration: -167.4157 
##  --- Best log-likelihood so far: -167.4157 ---
## log-likelihood from current iteration: -167.4157 
##  --- Best log-likelihood so far: -167.4157 ---

By setting model="ER" I assume that the transition rates between "0" and "1" depend only on the hidden state, not that all my transition rates are equal. (That wouldn’t really make sense!)

Here’s our fitted model:

hrm_fit1
## Object of class "fitHRM".
## 
## Observed states: [ 0, 1 ]
## Number of rate categories per state: [ 2, 2 ]
## 
## Fitted (or set) value of Q:
##            0        0*         1        1*
## 0  -0.306601  0.199011  0.107590  0.000000
## 0*  0.199011 -1.356206  0.000000  1.157195
## 1   0.107590  0.000000 -0.306601  0.199011
## 1*  0.000000  1.157195  0.199011 -1.356206
## 
## Fitted (or set) value of pi:
##    0   0*    1   1* 
## 0.25 0.25 0.25 0.25 
## due to treating the root prior as (a) flat.
## 
## Log-likelihood: -167.415684 
## 
## Optimization method used was "nlminb"
## 
## R thinks it has found the ML solution.
plot(hrm_fit1,mar=rep(0,4))

plot of chunk unnamed-chunk-9

Now let’s re-code our trait as a binary matrix, which should now be supported as a valid input format for fitHRM.

X<-to.matrix(x,levels(x))
head(X,20)
##      0 1
## t672 0 1
## t911 0 1
## t912 0 1
## t749 0 1
## t667 0 1
## t668 0 1
## t183 0 1
## t563 0 1
## t564 0 1
## t233 0 1
## t234 0 1
## t305 0 1
## t306 0 1
## t678 0 1
## t805 0 1
## t806 0 1
## t216 0 1
## t307 0 1
## t308 0 1
## t200 0 1

This is just a re-formatting of the same input data we provided before, so our fitted model should be exactly the same.

hrm_fit2<-fitHRM(tree,X,model="ER",ncat=2,
  niter=5,parallel=FALSE)
## 
## This is the design matrix of the fitted model.
## Does it make sense?
## 
##    0 0* 1 1*
## 0  0  3 1  0
## 0* 3  0 0  2
## 1  1  0 0  3
## 1* 0  2 3  0
## 
## log-likelihood from current iteration: -167.4157 
##  --- Best log-likelihood so far: -167.4157 ---
## log-likelihood from current iteration: -167.4157 
##  --- Best log-likelihood so far: -167.4157 ---
## log-likelihood from current iteration: -167.4157 
##  --- Best log-likelihood so far: -167.4157 ---
## log-likelihood from current iteration: -167.4157 
##  --- Best log-likelihood so far: -167.4157 ---
## log-likelihood from current iteration: -167.4157 
##  --- Best log-likelihood so far: -167.4157 ---
hrm_fit2
## Object of class "fitHRM".
## 
## Observed states: [ 0, 1 ]
## Number of rate categories per state: [ 2, 2 ]
## 
## Fitted (or set) value of Q:
##            0        0*         1        1*
## 0  -1.356206  0.199011  1.157195  0.000000
## 0*  0.199011 -0.306601  0.000000  0.107590
## 1   1.157195  0.000000 -1.356206  0.199011
## 1*  0.000000  0.107590  0.199011 -0.306601
## 
## Fitted (or set) value of pi:
##    0   0*    1   1* 
## 0.25 0.25 0.25 0.25 
## due to treating the root prior as (a) flat.
## 
## Log-likelihood: -167.415684 
## 
## Optimization method used was "nlminb"
## 
## R thinks it has found the ML solution.

So, that worked exactly as expected – which is good.

Finally, let’s introduce some random uncertainty into our data. In this case, our fitted model should be similar to the one we obtained with the “real” data – but slightly different.

To do this, I’m going to sample 40 tips at random and substitute [0.8,0.2] for [1,0] or [0.2,0.8] for [0,1]. Make sense?

This is a bit clunky, but will do the trick!

ii<-sample(1:nrow(X),40)
for(i in 1:length(ii)){
  if(X[ii[i],1]==1) X[ii[i],]<-c(0.8,0.2)
  else X[ii[i],]<-c(0.2,0.8)
}
head(X,40)
##        0   1
## t672 0.0 1.0
## t911 0.0 1.0
## t912 0.0 1.0
## t749 0.0 1.0
## t667 0.0 1.0
## t668 0.0 1.0
## t183 0.0 1.0
## t563 0.0 1.0
## t564 0.0 1.0
## t233 0.0 1.0
## t234 0.0 1.0
## t305 0.0 1.0
## t306 0.0 1.0
## t678 0.0 1.0
## t805 0.0 1.0
## t806 0.0 1.0
## t216 0.0 1.0
## t307 0.0 1.0
## t308 0.0 1.0
## t200 0.0 1.0
## t55  0.0 1.0
## t20  0.0 1.0
## t452 0.0 1.0
## t777 0.2 0.8
## t778 0.0 1.0
## t401 0.0 1.0
## t558 1.0 0.0
## t559 1.0 0.0
## t96  0.0 1.0
## t753 0.0 1.0
## t754 0.0 1.0
## t76  0.0 1.0
## t77  0.0 1.0
## t41  0.0 1.0
## t602 1.0 0.0
## t603 1.0 0.0
## t384 1.0 0.0
## t634 0.0 1.0
## t635 0.0 1.0
## t383 0.0 1.0

Great. Now let’s run fitHRM on this input data with our manufactured uncertainty.

hrm_fit3<-fitHRM(tree,X,model="ER",ncat=2,
  niter=5,parallel=FALSE)
## 
## This is the design matrix of the fitted model.
## Does it make sense?
## 
##    0 0* 1 1*
## 0  0  3 1  0
## 0* 3  0 0  2
## 1  1  0 0  3
## 1* 0  2 3  0
## 
## log-likelihood from current iteration: -182.3416 
##  --- Best log-likelihood so far: -182.3416 ---
## log-likelihood from current iteration: -174.2253 
##  --- Best log-likelihood so far: -174.2253 ---
## log-likelihood from current iteration: -174.2253 
##  --- Best log-likelihood so far: -174.2253 ---
## log-likelihood from current iteration: -174.2253 
##  --- Best log-likelihood so far: -174.2253 ---
## log-likelihood from current iteration: -183.0348 
##  --- Best log-likelihood so far: -174.2253 ---
hrm_fit3
## Object of class "fitHRM".
## 
## Observed states: [ 0, 1 ]
## Number of rate categories per state: [ 2, 2 ]
## 
## Fitted (or set) value of Q:
##            0        0*         1        1*
## 0  -1.305223  0.206871  1.098352  0.000000
## 0*  0.206871 -0.315629  0.000000  0.108758
## 1   1.098352  0.000000 -1.305223  0.206871
## 1*  0.000000  0.108758  0.206871 -0.315629
## 
## Fitted (or set) value of pi:
##    0   0*    1   1* 
## 0.25 0.25 0.25 0.25 
## due to treating the root prior as (a) flat.
## 
## Log-likelihood: -174.225268 
## 
## Optimization method used was "nlminb"
## 
## R thinks it has found the ML solution.
plot(hrm_fit3,mar=rep(0,4))

plot of chunk unnamed-chunk-17

Works.

I feel pretty good I implemented this correctly & without errors, but it’s a new feature for this particular function so users should be appropriately cautious and run their own tests!

Lastly, just as a reminder, the hidden-rates model is also implemented in the powerful corHMM package.

Happy Thanksgiving to those who celebrate it.