Thursday, June 1, 2023

Decommissioning rerootingMethod (and a comparison of methods for marginal ancestral state reconstruction in R)

phytools presently has two different functions that do marginal ancestral state reconstruction for discrete characters: ancr and rerootingMethod. The latter of these two implements a re-rooting algorithm described (so far as I can tell) in Yang et al. (1995). Since the way this algorithm works is by re-rooting the tree at each internal node and re-computing the partial (scaled) likelihoods at the (new, temporary) root each time, it obviously will not work for asymmetrical transition rates between states!

As ancr and rerootingMethod are essentially redundant, I’m “decommissioning” rerootingMethod. It will remain in the package; however, every use will trigger a “note” for the user suggesting ancr.

As part of this decommissioning, I thought it would be worth showing that the function returns the same marginal ancestral states as ancr (and other functions), so long as its assumptions are met.

library(phytools)
library(corHMM)

In this demo, I’ll use a dataset for feeding mode in sunfishes (Centrarchidae) from Revell & Collar (2009).

data(sunfish.tree)
sunfish.tree
## 
## Phylogenetic tree with 28 tips and 27 internal nodes.
## 
## Tip labels:
## 	Acantharchus_pomotis, Lepomis_gibbosus, Lepomis_microlophus, Lepomis_punctatus, Lepomis_miniatus, Lepomis_auritus, ...
## 
## The tree includes a mapped, 2-state discrete character
## with states:
## 	non, pisc
## 
## Rooted; includes branch lengths.

Our tree already has a mapped discrete character encoded (and is, thusly, an object of class "simmap" as well as "phylo"). Let’s strip that away using as.phylo as follows.

sunfish.tree<-as.phylo(sunfish.tree)
sunfish.tree
## 
## Phylogenetic tree with 28 tips and 27 internal nodes.
## 
## Tip labels:
##   Acantharchus_pomotis, Lepomis_gibbosus, Lepomis_microlophus, Lepomis_punctatus, Lepomis_miniatus, Lepomis_auritus, ...
## 
## Rooted; includes branch lengths.

Now our data.

data(sunfish.data)
head(sunfish.data)
##                      feeding.mode gape.width buccal.length
## Acantharchus_pomotis         pisc      0.114        -0.009
## Lepomis_gibbosus              non     -0.133        -0.009
## Lepomis_microlophus           non     -0.151         0.012
## Lepomis_punctatus             non     -0.103        -0.019
## Lepomis_miniatus              non     -0.134         0.001
## Lepomis_auritus               non     -0.222        -0.039

The discrete character from this dataset that we’ll analyze is feeding.mode, so let’s pull it into a vector.

feeding_mode<-setNames(sunfish.data$feeding.mode,rownames(sunfish.data))

Now we’re ready to compute marginal ancestral states. We can start with an "ER" (equal-rates) model. I’m going to use three different functions: the aforementioned ancr and rerootingMethod, as well as corHMM from the corHMM package, and ace in ape.

## ancr
sunfish.er_model<-fitMk(sunfish.tree,feeding_mode,model="ER")
sunfish.er_ancr<-ancr(sunfish.er_model)
sunfish.er_ancr
## Marginal ancestral state estimates:
##         non     pisc
## 29 0.310020 0.689980
## 30 0.311699 0.688301
## 31 0.358258 0.641742
## 32 0.827481 0.172519
## 33 0.987669 0.012331
## 34 0.995779 0.004221
## ...
## 
## Log-likelihood = -13.07453
## ape::ace
sunfish.er_ace<-ace(feeding_mode,sunfish.tree,type="discrete",model="ER")
sunfish.er_ace
## 
##     Ancestral Character Estimation
## 
## Call: ace(x = feeding_mode, phy = sunfish.tree, type = "discrete", 
##     model = "ER")
## 
##     Log-likelihood: -12.38138 
## 
## Rate index matrix:
##      non pisc
## non    .    1
## pisc   1    .
## 
## Parameter estimates:
##  rate index estimate std-err
##           1   4.2289  1.6056
## 
## Scaled likelihoods at the root (type '...$lik.anc' to get them for all nodes):
##     non    pisc 
## 0.31002 0.68998
## corHMM::corHMM
Data<-data.frame(Genus_species=names(feeding_mode),
  x=feeding_mode)
sunfish.er_corhmm<-corHMM(sunfish.tree,data=Data,rate.cat=1,model="ER",
  root.p=c(0.5,0.5))
## State distribution in data:
## States:	1	2	
## Counts:	12	16	
## Beginning thorough optimization search -- performing 0 random restarts 
## Finished. Inferring ancestral states using marginal reconstruction.
sunfish.er_corhmm
## 
## Fit
##       -lnL      AIC     AICc Rate.cat ntax
##  -13.07453 28.14906 28.30291        1   28
## 
## Legend
##      1      2 
##  "non" "pisc" 
## 
## Rates
##          (1,R1)   (2,R1)
## (1,R1)       NA 4.228909
## (2,R1) 4.228909       NA
## 
## Arrived at a reliable solution
## rerootingMethod
sunfish.er_yang<-rerootingMethod(sunfish.tree,feeding_mode,model="ER")
## 
## Note:
##    This function is redundant with 'phytools::ancr' in situations in
##    which it should be used (symmetric Q matrices) & invalid for non-
##    symmetric Q matrices (e.g., model='ARD').
print(sunfish.er_yang,printlen=10)
## Ancestral character estimates using re-rooting method
## of Yang et al. (1995):
##         non     pisc
## 29 0.310020 0.689980
## 30 0.311699 0.688301
## 31 0.358258 0.641742
## 32 0.827481 0.172519
## 33 0.987669 0.012331
## 34 0.995779 0.004221
## 35 0.999566 0.000434
## 36 0.999931 0.000069
## 37 0.999266 0.000734
## 38 0.999809 0.000191
## ...
## 
## Estimated transition matrix,
## Q =
##            non      pisc
## non  -4.228856  4.228856
## pisc  4.228856 -4.228856
## 
## **Note that if Q is not symmetric the marginal
## reconstructions may be invalid.
## 
## Log-likelihood = -13.07453

Superficiallly, these estimates look similar. Let’s compare them in all combinations.

layout(matrix(c(1,2,3,7,4,5,7,7,6),3,3,byrow=FALSE))
plot(sunfish.er_ancr$ace,sunfish.er_ace$lik.anc,
  xlab="phytools::ancr",ylab="ape::ace",
  bty="n",pch=16,col=make.transparent("blue",0.5))
grid()
mtext("a) phytools::ancr vs. ace (model=\"ER\")",line=2,
  cex=0.8)
plot(sunfish.er_ancr$ace,sunfish.er_corhmm$states,
  xlab="phytools::ancr",ylab="corHMM",
  bty="n",pch=16,col=make.transparent("blue",0.5))
grid()
mtext("b) phytools::ancr vs. corHMM (model=\"ER\")",line=2,
  cex=0.8)
plot(sunfish.er_ancr$ace,sunfish.er_yang$marginal.anc,
  xlab="phytools::ancr",ylab="phytools::rerootingMethod",
  bty="n",pch=16,col=make.transparent("blue",0.5))
grid()
mtext("c) phytools::ancr vs.rerootingMethod",
  line=2,cex=0.8)
plot(sunfish.er_ace$lik.anc,sunfish.er_corhmm$states,
  xlab="ape::ace",ylab="corHMM",
  bty="n",pch=16,col=make.transparent("blue",0.5))
grid()
mtext("d) ape::ace vs. corHMM (model=\"ER\")",line=2,
  cex=0.8)
plot(sunfish.er_ace$lik.anc,sunfish.er_yang$marginal.anc,
  xlab="ape::ace",ylab="phytools::rerootingMethod",
  bty="n",pch=16,col=make.transparent("blue",0.5))
grid()
mtext("e) ape::ace vs. rerootingMethod",line=2,
  cex=0.8)
plot(sunfish.er_corhmm$states,sunfish.er_yang$marginal.anc,
  xlab="corHMM",ylab="phytools::rerootingMethod",
  bty="n",pch=16,col=make.transparent("blue",0.5))
grid()
mtext("f) corHMM vs. phytools::rerootingMethod",line=2,
  cex=0.8)

plot of chunk unnamed-chunk-10

This shows us that all four approaches give us the same marginal estimates. That’s good.

Now let’s compare this to an asymmetric model: "ARD". Note that rerootingMethod should not be used for this type of model – but the function will still run & give a result. ancr and corHMM::corHMM are expected to behave fine. Previously I had identified problems with ape::ace for non-symmetric Q matrices.

## ancr
sunfish.ard_model<-fitMk(sunfish.tree,feeding_mode,model="ARD")
sunfish.ard_ancr<-ancr(sunfish.ard_model)
sunfish.ard_ancr
## Marginal ancestral state estimates:
##         non     pisc
## 29 0.554292 0.445708
## 30 0.568076 0.431924
## 31 0.589648 0.410352
## 32 0.931603 0.068397
## 33 0.995898 0.004102
## 34 0.998398 0.001602
## ...
## 
## Log-likelihood = -12.864937
sunfish.ard_ace<-ace(feeding_mode,sunfish.tree,type="discrete",model="ARD")
sunfish.ard_ace
## 
##     Ancestral Character Estimation
## 
## Call: ace(x = feeding_mode, phy = sunfish.tree, type = "discrete", 
##     model = "ARD")
## 
##     Log-likelihood: -12.17179 
## 
## Rate index matrix:
##      non pisc
## non    .    2
## pisc   1    .
## 
## Parameter estimates:
##  rate index estimate std-err
##           1   3.0489  2.1746
##           2   6.0878  2.5876
## 
## Scaled likelihoods at the root (type '...$lik.anc' to get them for all nodes):
##       non      pisc 
## 0.5542917 0.4457083
Data<-data.frame(Genus_species=names(feeding_mode),
  x=feeding_mode)
sunfish.ard_corhmm<-corHMM(sunfish.tree,data=Data,rate.cat=1,model="ARD",
  root.p=c(0.5,0.5))
## State distribution in data:
## States:	1	2	
## Counts:	12	16	
## Beginning thorough optimization search -- performing 0 random restarts 
## Finished. Inferring ancestral states using marginal reconstruction.
sunfish.ard_corhmm
## 
## Fit
##       -lnL      AIC     AICc Rate.cat ntax
##  -12.86494 29.72987 30.20987        1   28
## 
## Legend
##      1      2 
##  "non" "pisc" 
## 
## Rates
##          (1,R1)  (2,R1)
## (1,R1)       NA 6.08716
## (2,R1) 3.049321      NA
## 
## Arrived at a reliable solution
sunfish.ard_yang<-rerootingMethod(sunfish.tree,feeding_mode,model="ARD")
## 
## Note:
##    This function is redundant with 'phytools::ancr' in situations in
##    which it should be used (symmetric Q matrices) & invalid for non-
##    symmetric Q matrices (e.g., model='ARD').
print(sunfish.ard_yang,printlen=10)
## Ancestral character estimates using re-rooting method
## of Yang et al. (1995):
##         non     pisc
## 29 0.554292 0.445708
## 30 0.601789 0.398211
## 31 0.645854 0.354146
## 32 0.955758 0.044242
## 33 0.997528 0.002472
## 34 0.999098 0.000902
## 35 0.999857 0.000143
## 36 0.999975 0.000025
## 37 0.999785 0.000215
## 38 0.999937 0.000063
## ...
## 
## Estimated transition matrix,
## Q =
##            non      pisc
## non  -3.048905  6.087789
## pisc  3.048905 -6.087789
## 
## **Note that if Q is not symmetric the marginal
## reconstructions may be invalid.
## 
## Log-likelihood = -12.864937
layout(matrix(c(1,2,3,7,4,5,7,7,6),3,3,byrow=FALSE))
plot(sunfish.ard_ancr$ace,sunfish.ard_ace$lik.anc,
  xlab="phytools::ancr",ylab="ape::ace",
  bty="n",pch=16,col=make.transparent("blue",0.5))
grid()
mtext("a) phytools::ancr vs. ace (model=\"ARD\")",line=2,
  cex=0.8)
plot(sunfish.ard_ancr$ace,sunfish.ard_corhmm$states,
  xlab="phytools::ancr",ylab="corHMM",
  bty="n",pch=16,col=make.transparent("blue",0.5))
grid()
mtext("b) phytools::ancr vs. corHMM (model=\"ARD\")",line=2,
  cex=0.8)
plot(sunfish.ard_ancr$ace,sunfish.ard_yang$marginal.anc,
  xlab="phytools::ancr",ylab="phytools::rerootingMethod",
  bty="n",pch=16,col=make.transparent("blue",0.5))
grid()
mtext("c) phytools::ancr vs.rerootingMethod",
  line=2,cex=0.8)
plot(sunfish.ard_ace$lik.anc,sunfish.ard_corhmm$states,
  xlab="ape::ace",ylab="corHMM",
  bty="n",pch=16,col=make.transparent("blue",0.5))
grid()
mtext("d) ape::ace vs. corHMM (model=\"ARD\")",line=2,
  cex=0.8)
plot(sunfish.ard_ace$lik.anc,sunfish.ard_yang$marginal.anc,
  xlab="ape::ace",ylab="phytools::rerootingMethod",
  bty="n",pch=16,col=make.transparent("blue",0.5))
grid()
mtext("e) ape::ace vs. rerootingMethod",line=2,
  cex=0.8)
plot(sunfish.ard_corhmm$states,sunfish.ard_yang$marginal.anc,
  xlab="corHMM",ylab="phytools::rerootingMethod",
  bty="n",pch=16,col=make.transparent("blue",0.5))
grid()
mtext("f) corHMM vs. phytools::rerootingMethod",line=2,
  cex=0.8)

plot of chunk unnamed-chunk-15

OK, this just tells us that ancr and corHMM::corHMM yield the same results – it cannot arbitrate whether or not ape::ace is correct! (On the other hand, we know that rerootingMethod should have the wrong result here.)

How can we evaluate this discrepancy? Well, there’s a fourth R package that we can use to reconstruct marginal ancestral states, and thats the diversitree package by Rich FitzJohn. Let’s load it and put it to work.

(I need to apply a few small hacks here. Our tree barely fails is.ultrametric, so I’ll use force.ultrametric to pass. We also need to convert our factor vector to a numeric integer vector starting with 0.)

library(diversitree)
phy<-force.ultrametric(sunfish.tree,message=FALSE)
x<-setNames(as.numeric(feeding_mode),names(feeding_mode))-1
lik<-make.mk2(phy,x)
sunfish.ard_model2<-find.mle(lik,x.init=c(4,4),root=ROOT.FLAT)
sunfish.ard_diversitree<-asr.marginal(lik,
  pars=sunfish.ard_model2$par)
par(mar=c(5.1,4.1,2.1,1.1))
plot(sunfish.ard_ancr$ace,t(sunfish.ard_diversitree),
  xlab="phytools::ancr",ylab="diversitree::asr.marginal",
  bty="n",pch=16,col=make.transparent("blue",0.5),
  cex=2,las=1,cex.axis=0.8)
grid()
lines(c(0,1),c(0,1))

plot of chunk unnamed-chunk-16

This shows us that diversitree gives the same marginal states as phytools::ancr (and corHMM), so adds a bit of credence to my assertion that these are the correct ones.

Lastly, what about computation time?

To see that, I’m going to simulate a much larger tree & dataset – let’s say with 2,000 tips.

Q<-matrix(c(-1,1,0,0.5,1,0.5,0,1,-1),3,3,dimnames=list(0:2,0:2))
x<-sim.Mk(tree<-pbtree(n=2000,scale=1),Q)

Now, I’ll estimate marginal states by wrapping my function call of fitMk with an ancr call. (We could also using a pipe, %>%, here, if we wanted.)

system.time(ace_sym1<-ancr(fitMk(tree,x,model="SYM")))
##    user  system elapsed 
##   13.06    0.25   54.64
ace_sym1
## Marginal ancestral state estimates:
##             0        1        2
## 2001 0.960122 0.038142 0.001736
## 2002 0.999858 0.000142 0.000000
## 2003 0.999990 0.000010 0.000000
## 2004 0.999989 0.000011 0.000000
## 2005 0.999882 0.000118 0.000000
## 2006 0.999707 0.000293 0.000000
## ...
## 
## Log-likelihood = -633.448457

Now let’s do the same using ape::ace.

system.time(ace_sym2<-ace(x,tree,type="discrete",model="SYM"))
##    user  system elapsed 
##    0.29    0.00    1.04
ace_sym2
## 
##     Ancestral Character Estimation
## 
## Call: ace(x = x, phy = tree, type = "discrete", model = "SYM")
## 
##     Log-likelihood: -632.3498 
## 
## Rate index matrix:
##   0 1 2
## 0 . 1 2
## 1 1 . 3
## 2 2 3 .
## 
## Parameter estimates:
##  rate index estimate std-err
##           1   0.5920  0.0402
##           2   0.0000  0.0133
##           3   1.0815  0.1520
## 
## Scaled likelihoods at the root (type '...$lik.anc' to get them for all nodes):
##           0           1           2 
## 0.960121792 0.038141957 0.001736251

We can see that ace is much faster, so (inasmuch as it gives us the right estimates) that’s good! (To be fair, we could’ve used opt.method="optimParallel" and some other options to speed up fitMk, but we’d never have caught up to ace!)

par(mar=c(5.1,4.1,2.1,1.1))
plot(ace_sym1$ace,ace_sym2$lik.anc,
  xlab="phytools::ancr",ylab="ape::ace",
  bty="n",pch=16,col=make.transparent("blue",0.25),
  cex=1.2,las=1,cex.axis=0.8)
grid()
lines(c(0,1),c(0,1))

plot of chunk unnamed-chunk-22

That’s all folks!

No comments:

Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.