Tuesday, June 11, 2024

Fitting a discrete-state-dependent multi-rate continuous trait evolution using the discrete approximation of Boucher & Démery (2016): two empirical examples

Over the past few days I’ve posted about, and gave an R function for, a (so far as I know) novel method to modeling discrete-state-dependent rate heterogeneity in a continuous character on a phylogeny based on the discrete approximation of Boucher & Démery (2016).

I’ve written more about Boucher & Démery (2016) in multiple earlier posts to this blog, which I would encourage readers to check out (1, 2, 3, 4, 5, 6, 7, 8).

The new function (fitmultiBM) is now in phytools on GitHub. Please use with considerable caution. I’m still working out the bugs!

The idea of this method & function is to jointly model the evolution of a discrete character (evolving via a standard Mk stochastic process) and a continuous trait (whose rate of evolution \(\sigma^2\) is, in turn, influenced by the discrete character).

The standard way to investigate this type of hypothesis in phylogenetic comparative biology has been to first map the discrete character history on the tree (using, for example, stochastic character mapping), and then fit a regime-dependent model following O’Meara et al. (2006).

The disadvantage of this two-step process is that by first mapping regimes onto the tree in a way that considers only the discrete trait, we’re ignoring any information that our continuous character might contain about the discrete character history. Averaging across a set of such maps can in turn result in a biased underestimate of the difference in rate between regimes (e.g., see Revell 2013).

The discrete approximation allows us to simultaneously model our discrete and continuous characters. I predict that this will provide both more accurate estimates of the state-dependent rate variation in our continuous character, as well as a more accurate reconstruction of the evolutionary process for the discrete trait. I hope to explore this in future posts.

Before getting to that, though, I thought it might be neat to demo the method for a couple of empirical cases. These have been selected entirely due to the convenience of the data as both involve datasets that Luke Harmon & I used for our 2022 book with Princeton University Press.

We can begin, of course, by loading phytools.

library(phytools)
packageVersion("phytools")
## [1] '2.2.12'

(To follow along, you’ll need this version of phytools or later.)

Our first example involves a phylogeny and dataset for elopomorph eels from Collar et al. (2014). I’m going to start off by reading these two files directly from the book website.

eel_tree<-read.tree(
  file="http://www.phytools.org/Rbook/8/elopomorph.tre")
eel_data<-read.csv(
  file="http://www.phytools.org/Rbook/8/elopomorph.csv",
  row.names=1,stringsAsFactors=TRUE)

Our phylogeny contains 61 tips and 60 nodes.

eel_tree
## 
## Phylogenetic tree with 61 tips and 60 internal nodes.
## 
## Tip labels:
##   Moringua_edwardsi, Kaupichthys_nuchalis, Gorgasia_taiwanensis, Heteroconger_hassi, Venefica_proboscidea, Anguilla_rostrata, ...
## 
## Rooted; includes branch lengths.

Our data set has one discrete character (feeding mode: i.e., whether each eel species is a suction feeder or a biter) and a second continuous trait, overall body size. Let’s pull these out. Please keep in mind that I have no a priori reason to imagine that feeding mode influences the rate of body size evolution in eels, but these are the data we have!

head(eel_data)
##                   feed_mode Max_TL_cm
## Albula_vulpes       suction       104
## Anguilla_anguilla   suction        50
## Anguilla_bicolor    suction       120
## Anguilla_japonica   suction       150
## Anguilla_rostrata   suction       152
## Ariosoma_anago      suction        60
feed_mode<-setNames(eel_data$feed_mode,
  rownames(eel_data))
lnTL<-setNames(log(eel_data$Max_TL_cm),
  rownames(eel_data))

Let’s graph our data.

par(mfrow=c(1,2))
FMODE<-to.matrix(feed_mode[eel_tree$tip.label],
  levels(feed_mode))
plotTree(eel_tree,mar=c(5.1,4.1,1.1,1.1),
  fsize=0.4,ftype="i")
tiplabels(pie=FMODE,piecol=hcl.colors(2),cex=0.4)
par(mar=c(5.1,4.1,1.1,1.1))
phenogram(eel_tree,lnTL,spread.cost=c(1,0),ftype="i",
  fsize=0.4,ylab="log(max length)",quiet=TRUE,las=1,
  cex.axis=0.7)
tiplabels(pie=FMODE,piecol=hcl.colors(2),cex=0.4)
legend("topleft",levels(feed_mode),pch=21,
  pt.bg=hcl.colors(2),bty="n",cex=0.8,pt.cex=1.5)

plot of chunk unnamed-chunk-7

We’re now ready to fit our discrete-state-dependent multi-rate Brownian motion evolution model using the discrete approximation of Boucher & Démery (2016).

As I do that, one small feature that I’ll demo here is a hidden option called plot_model. When we set plot_model=TRUE, the function graphs the structure of the discretized model we’re fitting. I built this mostly for the purposes of debugging, but it’s generally useful for those readers of this blog interested in how these models are structured. The way to interpret the visualization is as a very large \(k \cdot n \times k \cdot n\) in size (for \(k\) levels of our discrete trait and \(n\) bins for our continuous character) model design matrix, in which each different color is a different estimated parameter in the model – either a different rate of evolution for our continuous character (on the sub- and superdiagonals) or a rate of transition between discrete states (elsewhere).

eel_sdmodel<-fitmultiBM(eel_tree,lnTL,feed_mode,
  model="ARD",parallel=TRUE,levs=100,
  plot_model=TRUE)

plot of chunk unnamed-chunk-9

Let’s print our results.

eel_sdmodel
## Object of class "fitmultiBM" based on
##     a discretization with k = 100 levels.
## 
## Fitted multi-rate BM model parameters:
##  levels: [ bite, suction ]
##   sigsq: [ 0.0098, 0.0214 ]
##      x0: 4.1992 
## 
## Estimated Q matrix:
##                 bite      suction
## bite    -0.013215826  0.013215826
## suction  0.008103207 -0.008103207
## 
## Log-likelihood: -100.7688 
## 
## R thinks it has found the ML solution.

We can see from the fitted model that the ML estimated \(\sigma^2\) rate of log(maximum body size) evolution is about twice as high in suction feeding compared to biting eels. What’s more difficult to say, however, is whether this model has significantly better explanatory power than one in which there was no difference in the rate of evolution of size between our two groups.

We can (now) fit this model in fitmultiBM by setting the hidden option null_model=TRUE. Let’s try it.

eel_nullmodel<-fitmultiBM(eel_tree,lnTL,feed_mode,
  model="ARD",parallel=TRUE,levs=100,
  null_model=TRUE)
eel_nullmodel
## Object of class "fitmultiBM" based on
##     a discretization with k = 100 levels.
## 
## Fitted multi-rate BM model parameters:
##  levels: [ bite, suction ]
##   sigsq: [ 0.0142, 0.0142 ]
##      x0: 4.1992 
## 
## Estimated Q matrix:
##                bite     suction
## bite    -0.01488082  0.01488082
## suction  0.01357827 -0.01357827
## 
## Log-likelihood: -102.2585 
## 
## R thinks it has found the ML solution.

Note, simply as an aside, that both our parameter estimates and log-likelihood under the null model should very closely match those estimated independent (e.g., using geiger::fitContinuous and fitMk), as follows.

geiger::fitContinuous(eel_tree,lnTL)$opt$sigsq
## [1] 0.01444782
phytools::fitMk(eel_tree,feed_mode,
  model="ARD",pi="mle")$rates
## [1] 0.01357832 0.01488083

(One issue I haven’t worked out is apparent above. To get the correct log-likelihood for the continuous data I have to set the root to its MLE, but for the moment I’m doing that for both the discrete and continuous trait at once, which I’m not sure is desirable. That’s why I needed to set pi="mle" in fitMk to get the the “correct” parameter estimates. I should be able to figure out how to fix this!)

logLik(geiger::fitContinuous(eel_tree,lnTL))+
  logLik(fitMk(eel_tree,feed_mode,model="ARD",
    pi="mle"))
## [1] -102.2191
## attr(,"df")
## [1] 2

To compare the two models, let’s use a generic anova call as follows.

anova(eel_nullmodel,eel_sdmodel)
##                  log(L) d.f.      AIC    weight
## eel_nullmodel -102.2585    4 212.5170 0.3799782
## eel_sdmodel   -100.7688    5 211.5377 0.6200218

Since these two models are nested, we could also undertake a likelihood-ratio test. Let’s do that using the CRAN package lmtest.

lmtest::lrtest(eel_nullmodel,eel_sdmodel)
## Likelihood ratio test
## 
## Model 1: eel_nullmodel
## Model 2: eel_sdmodel
##   #Df  LogLik Df  Chisq Pr(>Chisq)  
## 1   4 -102.26                       
## 2   5 -100.77  1 2.9793    0.08434 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Even though AIC slightly (though non-significantly, according to our likelihood ratio test) favors the multi-rate model, this is pretty weak evidence to support feeding mode dependent evolution of body size in eels!

There are a couple of other models that we might like to consider.

One possibility is that the rate of evolution of body size is heterogeneous, so our null model is indeed wrong – but this rate heterogeneity has nothing at all to do with feeding mode.

To examine this possibility we might fit a hidden-rate model in which our discrete character evolves but only the hidden character influences body size evolution. I’m going to do that by keeping null_model=TRUE but updating ncat=2. This is a “null model” in the sense that, as we’ll see in a sec, under this model the rate of evolution of log(maximum length) varies throughout the tree, but not as a function of our discrete character: feeding mode.

eel_nullmodel.hrm<-fitmultiBM(eel_tree,lnTL,feed_mode,
  model="ARD",ncat=2,parallel=TRUE,levs=100,
  null_model=TRUE)
eel_nullmodel.hrm
## Object of class "fitmultiBM" based on
##     a discretization with k = 100 levels.
## 
## Fitted multi-rate BM model parameters:
##  levels: [ bite, bite*, suction, suction* ]
##   sigsq: [ 0.0304, 0.0026, 0.0304, 0.0026 ]
##      x0: 4.368 
## 
## Estimated Q matrix:
##                 bite       bite*     suction    suction*
## bite     -0.04059342  0.02571261  0.01488080  0.00000000
## bite*     0.01968139 -0.03456219  0.00000000  0.01488080
## suction   0.01357809  0.00000000 -0.03929071  0.02571261
## suction*  0.00000000  0.01357809  0.01968139 -0.03325948
## 
## Log-likelihood: -96.8294 
## 
## R thinks it has found the ML solution.

Lastly, we can also consider a hidden-rate model in which both feeding mode and our hidden character influence the rate of evolution of eel body size.

eel_sdmodel.hrm<-fitmultiBM(eel_tree,lnTL,feed_mode,
  model="ARD",ncat=2,parallel=TRUE,levs=100)
eel_sdmodel.hrm
## Object of class "fitmultiBM" based on
##     a discretization with k = 100 levels.
## 
## Fitted multi-rate BM model parameters:
##  levels: [ bite, bite*, suction, suction* ]
##   sigsq: [ 0.0231, 0.0011, 0.0399, 0.0044 ]
##      x0: 4.3117 
## 
## Estimated Q matrix:
##                 bite       bite*     suction    suction*
## bite     -0.03540772  0.02098248  0.01442524  0.00000000
## bite*     0.01817748 -0.03260272  0.00000000  0.01442524
## suction   0.01189854  0.00000000 -0.03288102  0.02098248
## suction*  0.00000000  0.01189854  0.01817748 -0.03007601
## 
## Log-likelihood: -95.6653 
## 
## R thinks it has found the ML solution.

As a word of warning, as we progress up through these models they become both more parameter complex and also involve larger & larger model matrices for the discrete approximation. That means to reiterate the analyses above (depending on the speed of your comput er) you may need to start it running, go have dinner, and then come back to it!

Here’s a comparison of all four models, generally as they increase in model complexity.

anova(eel_nullmodel,eel_sdmodel,eel_nullmodel.hrm,
  eel_sdmodel.hrm)
##                       log(L) d.f.      AIC     weight
## eel_nullmodel     -102.25848    4 212.5170 0.05291132
## eel_sdmodel       -100.76884    5 211.5377 0.08633699
## eel_nullmodel.hrm  -96.82939    7 207.6588 0.60046674
## eel_sdmodel.hrm    -95.66532    9 209.3306 0.26028495

This shows us that the hidden-state heterogeneous rate model without state-dependent body size evolution is the best supported, accounting for its parameter complexity. This would indicate that though there may be rate heterogeneity of body size, it is not well-explained by our hypothesized discrete character.

For our second example, I’m going to use diel activity pattern and skull length in primates with data from Kirk & Kay (2004).

These data and tree are even easy to get our hands on than for our previous example as they now come packaged with phytools!

data(primate.tree)
primate.tree
## 
## Phylogenetic tree with 90 tips and 89 internal nodes.
## 
## Tip labels:
##   Allenopithecus_nigroviridis, Cercopithecus_mitis, Cercopithecus_cephus, Cercopithecus_petaurista, Erythrocebus_patas, Chlorocebus_aethiops, ...
## 
## Rooted; includes branch lengths.
data(primate.data)
lnSkull<-setNames(log(primate.data$Skull_length),
  rownames(primate.data))
activity<-setNames(primate.data$Activity_pattern,
  rownames(primate.data))

Once again, let’s graph our data.

par(mfrow=c(1,2))
ACTIVITY<-to.matrix(activity[primate.tree$tip.label],
  levels(activity))
plotTree(primate.tree,mar=c(5.1,4.1,1.1,1.1),
  fsize=0.3,ftype="i")
cols<-setNames(hcl.colors(n=3)[c(2,3,1)],
  colnames(ACTIVITY))
tiplabels(pie=ACTIVITY,piecol=cols,cex=0.4)
par(mar=c(5.1,4.1,1.1,1.1))
phenogram(primate.tree,lnSkull,spread.cost=c(1,0),ftype="i",
  fsize=0.3,ylab="log(skull length)",quiet=TRUE,las=1,
  cex.axis=0.7)
tiplabels(pie=ACTIVITY,piecol=cols,cex=0.4)
legend("topleft",colnames(ACTIVITY),pch=21,
  pt.bg=cols,bty="n",cex=0.8,pt.cex=1.5)

plot of chunk unnamed-chunk-25

This time, we can start off by fitting our null model.

primate_nullmodel<-fitmultiBM(primate.tree,
  lnSkull,activity,model="ER",parallel=TRUE,
  null_model=TRUE)
primate_nullmodel
## Object of class "fitmultiBM" based on
##     a discretization with k = 100 levels.
## 
## Fitted multi-rate BM model parameters:
##  levels: [ Cathemeral, Diurnal, Nocturnal ]
##   sigsq: [ 0.0026, 0.0026, 0.0026 ]
##      x0: 4.1912 
## 
## Estimated Q matrix:
##              Cathemeral      Diurnal    Nocturnal
## Cathemeral -0.005494909  0.002747454  0.002747454
## Diurnal     0.002747454 -0.005494909  0.002747454
## Nocturnal   0.002747454  0.002747454 -0.005494909
## 
## Log-likelihood: -12.8537 
## 
## R thinks it has found the ML solution.

Then proceed to the state-dependent counterpart.

primate_sdmodel<-fitmultiBM(primate.tree,lnSkull,
  activity,model="ER",parallel=TRUE)
primate_sdmodel
## Object of class "fitmultiBM" based on
##     a discretization with k = 100 levels.
## 
## Fitted multi-rate BM model parameters:
##  levels: [ Cathemeral, Diurnal, Nocturnal ]
##   sigsq: [ 9e-04, 0.0027, 0.0026 ]
##      x0: 4.1912 
## 
## Estimated Q matrix:
##              Cathemeral      Diurnal    Nocturnal
## Cathemeral -0.005499031  0.002749516  0.002749516
## Diurnal     0.002749516 -0.005499031  0.002749516
## Nocturnal   0.002749516  0.002749516 -0.005499031
## 
## Log-likelihood: -12.0096 
## 
## R thinks it has found the ML solution.

For these data, I’ll skip the hidden-rates model; however, let’s use anova to compare between the two.

anova(primate_nullmodel,primate_sdmodel)
##                      log(L) d.f.      AIC    weight
## primate_nullmodel -12.85368    3 31.70735 0.7605876
## primate_sdmodel   -12.00958    5 34.01916 0.2394124

In this case, once again, we find relatively evidence to support the idea of state-dependent continuous trait evolution in skull length.

If any readers would like to send me a biological example (perhaps from a study using stochastic mapping and brownie.lite) in which we’re likely to find strong evidence of discrete state dependent continuous trait evolution, I’d love to include it here!

Thanks for reading, people.

No comments:

Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.