Friday, December 8, 2023

Reconstructing ancestral states of a discrete character under Γ rate heterogeneity among branches: an empirical case study

Over the past couple of days I’ve shown how phytools could now be used to fit a discrete character evolution model with \(\Gamma\)-distributed rate heterogeneity among edges (yesterday morning); and how this fitted model can be applied to reconstruct ancestral states at internal nodes (this morning).

Here, I’m going to take use an empirical case study to illustrate how it matters.

For the case study, I’m going to use data & a phylogeny subsampled from an excellent dataset mapping bi- and tri-cellular pollen grains across a tree of flowering plants (Williams et al. 2014). (We also used these data in our book with Princeton University Press, so you can download the files from the book website.)

I’ll start off by loading phytools after updating it from GitHub as follows.

library(phytools)
## Loading required package: ape
## Loading required package: maps
packageVersion("phytools")
## [1] '2.0.9'

Now, I’ll load my dataset & tree from file.

pollen_data<-read.csv(
  file="http://www.phytools.org/Rbook/7/pollen-data.csv",
  row.names=1)
head(pollen_data)
##                        V1
## Nuphar_lutea            3
## Nuphar_pumila           2
## Nuphar_advena           2
## Austrobaileya_scandens  2
## Schisandra_propinqua    2
## Kadsura_heteroclita     2
pollen<-setNames(as.factor(pollen_data$V1),rownames(pollen_data))
pollen_tree<-read.tree(
  file="http://www.phytools.org/Rbook/7/pollen-tree.phy")
pollen_tree
## 
## Phylogenetic tree with 511 tips and 510 internal nodes.
## 
## Tip labels:
##   Nuphar_lutea, Nuphar_pumila, Nuphar_advena, Austrobaileya_scandens, Schisandra_propinqua, Kadsura_heteroclita, ...
## 
## Rooted; includes branch lengths.

Our trait has two levels, so we can start by fitting a standard extended Mk model with different backward & forward transition rates using fitMk as follows.

pollen_mk<-fitMk(pollen_tree,pollen,model="ARD",
  pi="fitzjohn",rand_start=TRUE)
pollen_mk
## Object of class "fitMk".
## 
## Fitted (or set) value of Q:
##           2         3
## 2 -0.002997  0.002997
## 3  0.004727 -0.004727
## 
## Fitted (or set) value of pi:
##        2        3 
## 0.860551 0.139449 
## due to treating the root prior as (a) nuisance.
## 
## Log-likelihood: -215.278634 
## 
## Optimization method used was "nlminb"
## 
## R thinks it has found the ML solution.

Now let’s do the same for our new \(\Gamma\) rate heterogeneous model using the new phytools function fitgammaMk

pollen_gamma<-fitgammaMk(pollen_tree,pollen,
  model="ARD",pi="fitzjohn",
  rand_start=TRUE,min.alpha=0.01)
pollen_gamma
## Object of class "fitgammaMk".
## 
## Fitted (or set) value of Q:
##           2         3
## 2 -0.007473  0.007473
## 3  0.009101 -0.009101
## 
## Fitted (or set) value of alpha rate heterogeneity
## (with 4 rate categories): 0.015511
## 
## Fitted (or set) value of pi:
##       2       3 
## 0.88223 0.11777 
## due to treating the root prior as (a) nuisance.
## 
## Log-likelihood: -207.155259 
## 
## Optimization method used was "nlminb"
## 
## R thinks it has found the ML solution.

Right away, it’s easy for me to ascertain that our \(\Gamma\) model is better-supported than our standard Mk model – but let’s run an anova function call on the two objects to check.

anova(pollen_mk,pollen_gamma)
##                 log(L) d.f.      AIC       weight
## pollen_mk    -215.2786    2 434.5573 0.0008053927
## pollen_gamma -207.1553    3 420.3105 0.9991946073

This indicates that almost all of the weight of evidence falls on our \(\Gamma\) variable-rate model.

As I showed this morning, it’s already straightforward to obtain ancestral states under our fitted model.

Here, I’m going to do it twice: once for our Mk model and then a second time for our \(\Gamma\) model. When I create my graph showing these marginal estimates mapped onto the tree, I’ll increase the node size of any node where neither state has a marginal scaled likelihood > 0.95.

pollen_mk.anc<-ancr(pollen_mk)
cex<-apply(pollen_mk.anc$ace,1,
  function(x) if(any(x>0.95)) 0.1 else 0.3)
cols<-setNames(viridisLite::viridis(n=2),2:3)
plot(pollen_mk.anc,type="arc",arc_height=0.2,
  args.nodelabels=list(cex=cex,piecol=cols))
mtext("a) standard Mk model",adj=0.05,line=-1)

plot of chunk unnamed-chunk-8

pollen_gamma.anc<-ancr(pollen_gamma)
cex<-apply(pollen_gamma.anc$ace,1,
  function(x) if(any(x>0.95)) 0.1 else 0.3)
cols<-setNames(viridisLite::viridis(n=2),2:3)
plot(pollen_gamma.anc,type="arc",arc_height=0.2,
  args.nodelabels=list(cex=cex,piecol=cols))
mtext(expression(paste("b) ",Gamma,
  " rate variable model")),
  adj=0.05,line=-1)

plot of chunk unnamed-chunk-9

Though there are some broad similarities between these two different reconstructions, there are also some differences.

Which reconstruction is better is, of course, impossible to know; however, the important point here is that they’re not the same.

par(mar=c(5.1,4.1,1.1,1.1))
plot(pollen_mk.anc$ace,pollen_gamma.anc$ace,bty="n",
  cex.axis=0.7,las=1,xlab="Mk estimates",
  ylab=expression(paste(Gamma," estimates")),pch=16,
  col=make.transparent("red",0.5))
grid()
lines(c(0,1),c(0,1),lwd=7,
  col=make.transparent("blue",0.25))
legend("topleft","1:1 line",lwd=7,
  col=make.transparent("blue",0.25),bty="n")

plot of chunk unnamed-chunk-10

Just for fun, let’s also throw in the hidden-rates model – which was the model we fit to these very same data in our book.

pollen_hrm<-fitHRM(pollen_tree,pollen,ncat=3,
  corHMM_model=TRUE,pi="fitzjohn",parallel=TRUE,
  niter=20)
## 
## This is the design matrix of the fitted model.
## Does it make sense?
## 
##     2 2* 2** 3 3* 3**
## 2   0  1   0 2  0   0
## 2*  3  0   4 0  5   0
## 2** 0  6   0 0  0   7
## 3   8  0   0 0  1   0
## 3*  0  9   0 3  0   4
## 3** 0  0  10 0  6   0
## 
## Opened cluster with 15 cores.
## Running optimization iterations in parallel.
## Please wait....
pollen_hrm
## Object of class "fitHRM".
## 
## Observed states: [ 2, 3 ]
## Number of rate categories per state: [ 3, 3 ]
## 
## Fitted (or set) value of Q:
##              2        2*          2**           3        3*         3**
## 2    -0.002860  0.002857     0.000000    0.000003  0.000000    0.000000
## 2*    0.001948 -0.053113     0.003072    0.000000  0.048092    0.000000
## 2**   0.000000  0.001802 -1631.512168    0.000000  0.000000 1631.510366
## 3   157.662581  0.000000     0.000000 -157.665438  0.002857    0.000000
## 3*    0.000000  0.541302     0.000000    0.001948 -0.546323    0.003072
## 3**   0.000000  0.000000    42.633595    0.000000  0.001802  -42.635397
## 
## Fitted (or set) value of pi:
##        2       2*      2**        3       3*      3** 
## 0.445173 0.053725 0.001102 0.445173 0.053725 0.001102 
## due to treating the root prior as (a) nuisance.
## 
## Log-likelihood: -201.315836 
## 
## Optimization method used was "optim"
## 
## R thinks it has found the ML solution.
pollen_hrm.anc<-ancr(pollen_hrm)
pollen_hrm.anc
## Marginal ancestral state estimates:
##            2       2*      2**        3       3*      3**
## 512 0.492819 0.007178 0.000003 0.492819 0.007178 0.000003
## 513 0.260684 0.644423 0.000960 0.000000 0.057213 0.036721
## 514 0.283152 0.645633 0.000279 0.000001 0.060247 0.010688
## 515 0.991548 0.007771 0.000000 0.000000 0.000677 0.000003
## 516 0.944497 0.050783 0.000010 0.000000 0.004327 0.000383
## 517 0.736656 0.241697 0.000005 0.000000 0.021449 0.000193
## ...
## 
## Log-likelihood = -201.315836

Let’s compare this model to our two previous ones.

anova(pollen_mk,pollen_gamma,pollen_hrm)
##                 log(L) d.f.      AIC       weight
## pollen_mk    -215.2786    2 434.5573 0.0006133741
## pollen_gamma -207.1553    3 420.3105 0.7609704687
## pollen_hrm   -201.3158   10 422.6317 0.2384161572

Interestingly, this would suggest that this model is not as well-supported as our \(\Gamma\) rate-heterogeneous model. (Note that Williams et al. 2014 et used a much larger tree & dataset. With that data, we might obtain a different result.)

Cool. That’s all for now, folks!

No comments:

Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.