Friday, September 25, 2020

New object class and methods for Q transition matrix of discrete character models

As of last weekend, phytools is updated on CRAN.

Since I've had a hard time keeping up with this blog, I'm now trying to (retroactively) document some of the new features that this update contains.

One update is the object class "Qmatrix" and associated S3 methods, including an as.Qmatrix method.

The idea of this class & method is that different functions that fit discrete character models to data (and thus estimate a Q transition matrix between states) return the rates of this estimated matrix in a whole bunch of different ways. The as.Qmatrix method (and "Qmatrix" object class), allows us the to pull out this matrix from our fitted model object in a standardized fashion.

To see how this works.

I'll use a phylogeny of lizards in the family Liolaemidae from South America, along with data for parity mode. The tree and data in this example come from Esquerré et al. (2018).

library(geiger)
library(phytools)
packageVersion("phytools")
## [1] '0.7.70'
Liol.tree<-read.nexus("Liolaemidae.MCC.nex")
Liol.tree
## 
## Phylogenetic tree with 258 tips and 257 internal nodes.
## 
## Tip labels:
##   Ctenoblepharys_adspersa, Liolaemus_abaucan, Liolaemus_koslowskyi, Liolaemus_albiceps, Liolaemus_irregularis, Liolaemus_ornatus, ...
## 
## Rooted; includes branch lengths.
Liol.data<-read.csv("Liolaemidae.data.csv",row.names=1,
    stringsAsFactors=TRUE)
head(Liol.data)
##                         parity_mode max_altitude temperature
## Ctenoblepharys_adspersa           O          750       23.05
## Liolaemus_abaucan                 O         2600       20.20
## Liolaemus_albiceps                V         4020       12.38
## Liolaemus_andinus                 V         4900       11.40
## Liolaemus_annectens               V         4688        5.10
## Liolaemus_anomalus                O         1400       23.78
name.check(Liol.tree,Liol.data)
## [1] "OK"

Let's plot our tree and data. Here, instead of using a 'canned' plotting routine, I'm going to just plot my tree as normal, and then I'll add different colored horizontal bars with the base R function polygon at the vertical position of each tip. This will help us to see the distribution of our trait across the tips of the tree more easily.

h<-max(nodeHeights(Liol.tree))
plotTree(Liol.tree,ftype="off",lwd=1,
    xlim=c(0,1.05*h))
parity_mode<-setNames(Liol.data$parity_mode,
    rownames(Liol.data))
levels(parity_mode)<-c("oviparous","viviparous")
pp<-get("last_plot.phylo",envir=.PlotPhyloEnv)
cols<-setNames(c("#f0ead6",palette()[2]),
    levels(parity_mode))[parity_mode[Liol.tree$tip.label]]
for(i in 1:Ntip(Liol.tree))
    polygon(c(h,1.05*h,1.05*h,h),
        rep(pp$yy[i],4)+c(-0.5,-0.5,0.5,0.5),
        border=FALSE,col=cols[i])
legend("topleft",levels(parity_mode),pch=15,
    col=c("#f0ead6",palette()[2]),pt.cex=1.5,
    cex=0.8,bty="n")

plot of chunk unnamed-chunk-2

Neat. Now let's start by fitting a simple Mk model to our data - first using fitMk in phytools and then using fitContinuous in geiger.

Mk.fit<-fitMk(Liol.tree,parity_mode,model="ARD",
    pi="fitzjohn")
Mk.fit
## Object of class "fitMk".
## 
## Fitted (or set) value of Q:
##            oviparous viviparous
## oviparous  -0.033819   0.033819
## viviparous  0.000000   0.000000
## 
## Fitted (or set) value of pi:
##  oviparous viviparous 
##          1          0 
## due to treating the root prior as (a) nuisance.
## 
## Log-likelihood: -65.946269 
## 
## Optimization method used was "nlminb"
fD.fit<-fitDiscrete(Liol.tree,parity_mode,model="ARD")
fD.fit
## GEIGER-fitted comparative model of discrete data
##  fitted Q matrix:
##                    oviparous    viviparous
##     oviparous  -3.381880e-02  3.381880e-02
##     viviparous  3.023729e-17 -3.023729e-17
## 
##  model summary:
##  log-likelihood = -65.946271
##  AIC = 135.892541
##  AICc = 135.939600
##  free parameters = 2
## 
## Convergence diagnostics:
##  optimization iterations = 100
##  failed iterations = 0
##  number of iterations with same best fit = 21
##  frequency of best fit = 0.21
## 
##  object summary:
##  'lik' -- likelihood function
##  'bnd' -- bounds for likelihood search
##  'res' -- optimization iteration summary
##  'opt' -- maximum likelihood parameter estimates

Next, let's pull our Q matrices from each fitted model object to compare them:

as.Qmatrix(Mk.fit)
## Estimated Q matrix:
##             oviparous viviparous
## oviparous  -0.0338188  0.0338188
## viviparous  0.0000000  0.0000000
as.Qmatrix(fD.fit)
## Estimated Q matrix:
##                oviparous    viviparous
## oviparous  -3.381880e-02  3.381880e-02
## viviparous  3.023729e-17 -3.023729e-17

These are basically the same. (fitDiscrete uses a bounded optimization that is effectively bounded above zero. That's why it ends up with paramter estimates that are very slightly above zero.) Neat.

A useful function of this is that we can also then easily pass the fitted value of Q from one method to the likelihood function of the other.

Let's see.

Qhat<-as.Qmatrix(fD.fit)
Mk.fit$lik ## this is our likelihood function
## function (q) 
## -lik(q, output.liks = FALSE, pi = if (root.prior == "nuisance") "fitzjohn" else pi)
## <bytecode: 0x000000001b7da088>
## <environment: 0x0000000018945390>
Mk.fit$lik(Qhat)
## [1] -65.94627

Now let's try our hidden-rates model.

Remember, this is a model in which some or all of our observed trait character states have one or more unobserved states. Rates of transition between these different uobserved states are allowed to assume different values. As such, this is a model that permits heterogeneity in the evolution rate for a discrete character in different parts of the tree. This makes sense for a character like parity mode - so why don't we try to fit it to our data.

hrm.fit<-fitHRM(Liol.tree,parity_mode,pi="fitzjohn",
    niter=20)
## 
## This is the design matrix of the fitted model.
## Does it make sense?
## 
##             oviparous oviparous* viviparous viviparous*
## oviparous           0          1          2           0
## oviparous*          3          0          0           4
## viviparous          5          0          0           6
## viviparous*         0          7          8           0
## 
## log-likelihood from current iteration: -61.0197711223499 
##  --- Best log-likelihood so far: -61.0197711223499 ---
## log-likelihood from current iteration: -60.6924346307293 
##  --- Best log-likelihood so far: -60.6924346307293 ---
## log-likelihood from current iteration: -61.0733471806403 
##  --- Best log-likelihood so far: -60.6924346307293 ---
## log-likelihood from current iteration: -62.4486138909471 
##  --- Best log-likelihood so far: -60.6924346307293 ---
## log-likelihood from current iteration: -65.9463331744238 
##  --- Best log-likelihood so far: -60.6924346307293 ---
## log-likelihood from current iteration: -62.447949469765 
##  --- Best log-likelihood so far: -60.6924346307293 ---
## log-likelihood from current iteration: -63.9024247439437 
##  --- Best log-likelihood so far: -60.6924346307293 ---
## log-likelihood from current iteration: -60.6924346307343 
##  --- Best log-likelihood so far: -60.6924346307293 ---
## log-likelihood from current iteration: -62.4482458361831 
##  --- Best log-likelihood so far: -60.6924346307293 ---
## log-likelihood from current iteration: -65.9439241118116 
##  --- Best log-likelihood so far: -60.6924346307293 ---
## log-likelihood from current iteration: -65.9462741239522 
##  --- Best log-likelihood so far: -60.6924346307293 ---
## log-likelihood from current iteration: -85.7026994881037 
##  --- Best log-likelihood so far: -60.6924346307293 ---
## log-likelihood from current iteration: -65.946269375495 
##  --- Best log-likelihood so far: -60.6924346307293 ---
## log-likelihood from current iteration: -60.714873606756 
##  --- Best log-likelihood so far: -60.6924346307293 ---
## log-likelihood from current iteration: -61.019765219294 
##  --- Best log-likelihood so far: -60.6924346307293 ---
## log-likelihood from current iteration: -62.4479322276706 
##  --- Best log-likelihood so far: -60.6924346307293 ---
## log-likelihood from current iteration: -63.9785288298267 
##  --- Best log-likelihood so far: -60.6924346307293 ---
## log-likelihood from current iteration: -62.4479768466855 
##  --- Best log-likelihood so far: -60.6924346307293 ---
## log-likelihood from current iteration: -61.0197570736084 
##  --- Best log-likelihood so far: -60.6924346307293 ---
## log-likelihood from current iteration: -62.4479569470663 
##  --- Best log-likelihood so far: -60.6924346307293 ---
hrm.fit
## Object of class "fitHRM".
## 
## Observed states: [ oviparous, viviparous ]
## Number of rate categories per state: [ 2, 2 ]
## 
## Fitted (or set) value of Q:
##             oviparous oviparous* viviparous viviparous*
## oviparous   -1.971161   0.039879   1.931282    0.000000
## oviparous*   0.037896  -0.037896   0.000000    0.000000
## viviparous   2.697096   0.000000  -3.111551    0.414455
## viviparous*  0.000000   0.000000   0.000000    0.000000
## 
## Fitted (or set) value of pi:
##   oviparous  oviparous*  viviparous viviparous* 
##    0.034061    0.939411    0.026527    0.000000 
## due to treating the root prior as (a) nuisance.
## 
## Log-likelihood: -60.692435 
## 
## Optimization method used was "nlminb"
as.Qmatrix(hrm.fit)
## Estimated Q matrix:
##               oviparous  oviparous* viviparous viviparous*
## oviparous   -1.97116076  0.03987888   1.931282   0.0000000
## oviparous*   0.03789636 -0.03789636   0.000000   0.0000000
## viviparous   2.69709588  0.00000000  -3.111551   0.4144553
## viviparous*  0.00000000  0.00000000   0.000000   0.0000000

We can likewise fit the same model using corHMM in the corHMM package by Beaulieu et al. (2020). We'll do that as follows:

library(corHMM)
corhmm.data<-data.frame(Genus_species=names(parity_mode),
    parity_mode=as.numeric(parity_mode))
head(corhmm.data)
##             Genus_species parity_mode
## 1 Ctenoblepharys_adspersa           1
## 2       Liolaemus_abaucan           1
## 3      Liolaemus_albiceps           2
## 4       Liolaemus_andinus           2
## 5     Liolaemus_annectens           2
## 6      Liolaemus_anomalus           1
corhmm.fit<-corHMM(Liol.tree,corhmm.data,rate.cat=2,nstarts=20,
    root.p="maddfitz")
## State distribution in data:
## States:  1   2   
## Counts:  79  179 
## Beginning thorough optimization search -- performing 20 random restarts 
## Finished. Inferring ancestral states using marginal reconstruction.
corhmm.fit
## 
## Fit
##       -lnL     AIC     AICc Rate.cat ntax
##  -61.12349 134.247 134.5816        2  258
## 
## Rates
##            (1,R1)      (2,R1)      (1,R2)    (2,R2)
## (1,R1)         NA 0.270645361 0.032034295        NA
## (2,R1) 5.4693e+00          NA          NA 0.0320343
## (1,R2) 1.0000e-09          NA          NA 0.2677090
## (2,R2)         NA 0.000000001 0.000000001        NA
## 
## Arrived at a reliable solution

We get a slightly different solution; however, I also wrote an as.Qmatrix for the corhmm object class, so let's use it to check & see if our fitted model gives us the same likelihood in fitHRM.

We have to reorder the rows & columns of our fitted transition matrix, Q, because corHMM and fitHRM set the order of the hidden-states slightly differently.

Qhat<-as.Qmatrix(corhmm.fit)
Qhat
## Estimated Q matrix:
##              (1,R1)       (2,R1)       (1,R2)       (2,R2)
## (1,R1) -0.302679656  0.270645361  0.032034295  0.000000000
## (2,R1)  5.469299602 -5.501333897  0.000000000  0.032034295
## (1,R2)  0.000000001  0.000000000 -0.267708984  0.267708983
## (2,R2)  0.000000000  0.000000001  0.000000001 -0.000000002
Qhat[]<-Qhat[c(1,3,2,4),c(1,3,2,4)]

Now compute the likelihood:

hrm.fit$lik(Qhat)
## [1] -61.12349

We see that this matches exactly the likelihood that was computed by corHMM, strongly indicating that both functions are fitting the same model (which is good). In this case, fitHRM does a little bit better, but if we repeated our optimization, we might find that corHMM did better, or that they produced the same result.

Finally, let's fit a different hidden-rates model, but in this case one in which transitions between the hidden states are not allowed.

In fitHRM this is called the umbral=TRUE model because of its similarity to the threshold model. One way to think about the umbral model as a model in which each observed character level can be in a 'warm' or 'cold' condition. When a lineage is in the cold condition for the trait, changes are turned off completely. The only way to change is by first transitioning back to the warm condition. This model is just a generalization of the covarion model of Galtier (2001), but in which each not each trait level needs to have a hidden rate category, and in which the number of hidden rate categories can be more than two per observed state (although I've no idea if this model is actually identifiable).

hrm.umbral<-fitHRM(Liol.tree,parity_mode,umbral=TRUE,
    pi="fitzjohn")
## 
## This is the design matrix of the fitted model.
## Does it make sense?
## 
##             oviparous oviparous* viviparous viviparous*
## oviparous           0          1          2           0
## oviparous*          3          0          0           0
## viviparous          4          0          0           5
## viviparous*         0          0          6           0
## 
## log-likelihood from current iteration: -65.9469992403555 
##  --- Best log-likelihood so far: -65.9469992403555 ---
## log-likelihood from current iteration: -65.9470202798638 
##  --- Best log-likelihood so far: -65.9469992403555 ---
## log-likelihood from current iteration: -60.6924346307326 
##  --- Best log-likelihood so far: -60.6924346307326 ---
## log-likelihood from current iteration: -63.9197013881581 
##  --- Best log-likelihood so far: -60.6924346307326 ---
## log-likelihood from current iteration: -65.9470744150344 
##  --- Best log-likelihood so far: -60.6924346307326 ---
## log-likelihood from current iteration: -65.94289825734 
##  --- Best log-likelihood so far: -60.6924346307326 ---
## log-likelihood from current iteration: -60.6924346307313 
##  --- Best log-likelihood so far: -60.6924346307313 ---
## log-likelihood from current iteration: -65.9462856362377 
##  --- Best log-likelihood so far: -60.6924346307313 ---
## log-likelihood from current iteration: -63.9121710758381 
##  --- Best log-likelihood so far: -60.6924346307313 ---
## log-likelihood from current iteration: -66.1082211100812 
##  --- Best log-likelihood so far: -60.6924346307313 ---
hrm.umbral
## Object of class "fitHRM".
## 
## Observed states: [ oviparous, viviparous ]
## Number of rate categories per state: [ 2, 2 ]
## 
## Fitted (or set) value of Q:
##             oviparous oviparous* viviparous viviparous*
## oviparous   -1.971170   0.039879   1.931291    0.000000
## oviparous*   0.037896  -0.037896   0.000000    0.000000
## viviparous   2.697106   0.000000  -3.111561    0.414455
## viviparous*  0.000000   0.000000   0.000000    0.000000
## 
## Fitted (or set) value of pi:
##   oviparous  oviparous*  viviparous viviparous* 
##    0.034062    0.939411    0.026528    0.000000 
## due to treating the root prior as (a) nuisance.
## 
## Log-likelihood: -60.692435 
## 
## Optimization method used was "nlminb"

We can graph each of these three models as follows.

layout(matrix(c(1,1,2,2,4,3,3,4),2,4,byrow=TRUE))
plot(Mk.fit)
mtext("a) ARD model",adj=0.1,line=0,cex=0.8)
plot(hrm.umbral,spacer=0.3)
mtext("b) \"umbral\" hidden-rates model",adj=0.1,
    line=0,cex=0.8)
plot(hrm.fit,spacer=0.15)
mtext("c) hidden-rates model",adj=0.1,
    line=0,cex=0.8)

plot of chunk unnamed-chunk-11

Finally, let's compare our three models using AIC.

AIC(Mk.fit,hrm.umbral,hrm.fit)
##            df      AIC
## Mk.fit      2 135.8925
## hrm.umbral  6 133.3849
## hrm.fit     8 137.3849

This tells us that our “umbral” HRM is the best fit to our data - although not by much.

3 comments:

  1. Hi there - I was wondering if you could share your version of 'Liolaemidae.data.csv'- the version from the paper is in word format and converting it to reproduce your work here has proved awkward

    ReplyDelete
    Replies
    1. Hi Unknown. I had the same problem. I believe I was able to convert the PDF to a Word document, and then I copied & pasted the table into Excel before saving as CSV. If you email me I will send you the data file. - Liam

      Delete

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.