Thursday, May 30, 2024

Averaging marginal ancestral states across a set of trees (with different nodes)

A colleague contacted me the other day with the following question about phytools::ancr (for ancestral state reconstruction).

”…can you suggest a post or example where you use ancr with multiple trees? I have everything working with single trees, but I want to do ASR using a post-mrBayes set of 3k trees, and then summarize the info using a consensus tree.”

This is a great question that will almost certainly be of interest to many readers of this blog.

Unfortunately, in my opinion there’s no single “correct” answer. What I can show how to do is match nodes and average marginal discrete character ancestral state estimates across trees. This is only one of the things that we might do to summarize a set of ancestral state reconstructions from a sample of phylogenies, but at least it’s a start!

So, in summary, what I’m going to do below is as follows. First, I’ll fit our set of discrete character model for each of the trees in my set. Next I’m going to compute a summary tree – in my case it will be a maximum clade credibility (MCC) tree, though I could’ve used any consensus or summary method. Then, I’m going to identify matching nodes (defined by sharing the same set of descendants – that is, defining the same bipartition) between the MCC and each of the trees in our set. Finally, I’ll go tree by tree & node by node, and for each node ask (first) is the node present in the MCC tree and (if so, second) I will include the marginal scaled likelihoods in the calculation of the average marginal likelihoods for that node.

library(phytools)

For this example, I’ll use a set of trees from the 10k primate trees website of Charlie Nunn. (We have a file from this site on our book website contain 100 trees.)

primate_trees<-read.nexus(
  file="http://www.phytools.org/Rbook/9/10kTrees_Primates.nex")
primate_trees
## 100 phylogenetic trees

For our trait data we’ll use diel activity pattern (cathemeral, diurnal, and nocturnal) from Kirk & Kay (2004).

primate_data<-read.csv(
  file="http://www.phytools.org/Rbook/3/primateEyes.csv",
  row.names=1,stringsAsFactors=TRUE)
head(primate_data)
##                                  Group Skull_length Optic_foramen_area Orbit_area Activity_pattern Activity_pattern_code
## Allenopithecus_nigroviridis Anthropoid         98.5                7.0      298.7          Diurnal                     0
## Alouatta_palliata           Anthropoid        109.8                5.3      382.3          Diurnal                     0
## Alouatta_seniculus          Anthropoid        108.0                8.0      359.4          Diurnal                     0
## Aotus_trivirgatus           Anthropoid         60.5                3.1      297.4        Nocturnal                     2
## Arctocebus_aureus            Prosimian         49.5                1.2      134.8        Nocturnal                     2
## Arctocebus_calabarensis      Prosimian         53.8                1.6      156.6        Nocturnal                     2

There are mismatches between the tree and data. We can see them by running geiger::name.check.

summary(geiger::name.check(primate_trees[[1]],
  primate_data))
## 188 taxa are present in the tree but not the data:
##     Allocebus_trichotis,
##     Alouatta_belzebul,
##     Alouatta_caraya,
##     Alouatta_guariba,
##     Alouatta_pigra,
##     Alouatta_sara,
##     ....
## 5 taxa are present in the data but not the tree:
##     Eulemur_fulvus,
##     Eulemur_macaco,
##     Gorilla_gorilla,
##     Pan_troglodytes,
##     Varecia_variegata
## 
## To see complete list of mis-matched taxa, print object.

For these data I was able to ascertain that the list of five taxa listed as “missing” in the tree (Eulemur fulvus, Eulemur macaco, etc.) are being identified as absent only because on the phylogeny they’re labeled with using subspecific epithet (e.g., Pan troglodytes troglodytes). I’m going to trim off all subspecies labels from our tip names as follows.

foo<-function(x){
  nn<-strsplit(x$tip.label,"_")
  x$tip.label<-sapply(nn,function(y) paste(y[1:2],
    collapse="_"))
  x
}
sub.primate_trees<-lapply(primate_trees,foo)
class(sub.primate_trees)<-"multiPhylo"

Let’s check again.

summary(geiger::name.check(sub.primate_trees[[1]],
  primate_data))
## 182 taxa are present in the tree but not the data:
##     Allocebus_trichotis,
##     Alouatta_belzebul,
##     Alouatta_caraya,
##     Alouatta_guariba,
##     Alouatta_pigra,
##     Alouatta_sara,
##     ....
## 
## To see complete list of mis-matched taxa, print object.

At this point we should find that there are taxa in our tree and not the data, but not the converse. Let’s prune them out as follows. (Note that we could’ve left these species in and just treated them as ambiguous for the trait. Perhaps I’ll illustrate this in another post!)

primate_trees<-keep.tip(sub.primate_trees,
  rownames(primate_data))

(ape::keep.tip is vectorized which is why I didn’t need to iterate over the set!)

Now let’s pull out the trait vector.

activity<-setNames(primate_data$Activity_pattern,
  rownames(primate_data))

Finally, we can fit our trait model to each of the trees in our input set. Since this is for illustrative purposes only I’ll just fit the “ER” model. In empirical practice, we might fit multiple models and compare them. (Since we’re fitting a simple model here this runs reasonably quickly. If this was going to be slow, we might speed it up using parallelization, e.g., with the foreach package).

er_fits<-lapply(primate_trees,fitMk,x=activity,pi="fitzjohn",
  model="ER",lik.func="pruning")

Our object here (er_fits) is a simple list of fitted Mk models. To conduct marginal ancestral state estimation on this set, let’s use lapply again as follows.

er_anc<-lapply(er_fits,ancr)

At this point, we’ve run our marginal ancestral state estimation across all 100 trees in our input set. The question is, how do we summarize these results!

As mentioned early, there is no single correct solution, but one reasonable thing to do is first compute a consensus tree (or “best” tree by some measure), then look at each node in that tree & average the marginal reconstruction across all trees in which each node from the best tree is present. This is what we’re going to do, but it naturally requires that we first identify our best or consensus tree. To get the MCC tree we can use phangorn::maxCladeCred. Let’s do it.

mcc_tree<-phangorn::maxCladeCred(primate_trees)
mcc_tree
## 
## Phylogenetic tree with 90 tips and 89 internal nodes.
## 
## Tip labels:
##   Allenopithecus_nigroviridis, Cercopithecus_cephus, Cercopithecus_mitis, Cercopithecus_petaurista, Chlorocebus_aethiops, Erythrocebus_patas, ...
## Node labels:
##   NA, 1, 1, 1, 1, 1, ...
## 
## Rooted; includes branch lengths.

Next, let’s match nodes between our set of trees and the MCC tree using phytools::matchNodes. (Note that a phytools user recently identified a bug in matchNodes due to a change or update of an internally used ape function. That means that, depending on your phytools version, you may need to update from GitHub to get this to work!)

Mn<-lapply(primate_trees,matchNodes,tr1=mcc_tree)

This object (Mn) is a list of matrices – each one consisting of a list of node numbers in mcc_tree (in column 1) and the matching node, if it exists, in each tree of the group (in column 2 of each matrix).

Finally, let’s average across our set. (To make it an average, we’ll also need to keep track of the counts – this is the number of each time a node in the MCC tree is found in the trees of our group.)

ANC<-matrix(0,nrow(Mn[[1]]),ncol(er_anc[[1]]$ace),
  dimnames=list(Mn[[1]][,1],colnames(er_anc[[1]]$ace)))
Nn<-setNames(rep(0,nrow(Mn[[1]])),Mn[[1]][,1])
for(i in 1:length(er_anc)){
  for(j in 1:nrow(Mn[[i]])){
    if(!is.na(Mn[[i]][j,2])){
      ANC[j,]<-ANC[j,]+er_anc[[i]]$ace[as.character(Mn[[i]][j,2]),]
      Nn[j]<-Nn[j]+1
    }
  }
}
for(i in 1:nrow(ANC)) ANC[i,]<-ANC[i,]/Nn[i]
head(ANC)
##      Cathemeral     Diurnal    Nocturnal
## 91 7.298402e-05 0.021434804 9.784922e-01
## 92 2.246178e-03 0.051983917 9.457699e-01
## 93 6.179040e-04 0.001230314 9.981518e-01
## 94 3.911125e-03 0.940884197 5.520468e-02
## 95 9.752281e-06 0.999941719 4.852886e-05
## 96 1.151733e-05 0.999973662 1.482086e-05

Just by chance, Nn also contains a measure of the support (in the case of a Bayesian sample, the posterior probability) of each node. Let’s graph all the ones with measured posterior probabilities < 1.0 onto the tree.

plotTree(mcc_tree,direction="upwards",ftype="i",fsize=0.6)
PP<-cbind(Nn/100,1-Nn/100)
nn<-which(PP[,1]<0.999)
par(fg="transparent")
nodelabels(node=as.numeric(names(nn)),pie=PP[nn,],
  piecol=c("black","grey"),cex=0.4)

plot of chunk unnamed-chunk-16

par(fg="black")

Finally, here’s our averaged marginal ancestral states.

plotTree(mcc_tree,direction="upwards",ftype="i",fsize=0.6,
  offset=1.5)
cols<-setNames(hcl.colors(3)[c(2,3,1)],colnames(ANC))
nodelabels(pie=ANC,piecol=cols,cex=0.4)
tiplabels(pie=to.matrix(activity[mcc_tree$tip.label],
  colnames(ANC)),piecol=cols,cex=0.2)
legend("bottomright",names(cols),pch=21,pt.bg=cols,
  cex=0.8,pt.cex=1.4)

plot of chunk unnamed-chunk-17

Just to illustrate it, we could’ve also done the same thing had our preferred summary method was a consensus tree rather than the MCC tree. Here’s what that would look like.

cons_tree<-consensus(primate_trees,p=0.95,rooted=TRUE)
Mn<-lapply(primate_trees,matchNodes,tr1=cons_tree)
ANC<-matrix(0,nrow(Mn[[1]]),ncol(er_anc[[1]]$ace),
  dimnames=list(Mn[[1]][,1],colnames(er_anc[[1]]$ace)))
Nn<-setNames(rep(0,nrow(Mn[[1]])),Mn[[1]][,1])
for(i in 1:length(er_anc)){
  for(j in 1:nrow(Mn[[i]])){
    if(!is.na(Mn[[i]][j,2])){
      ANC[j,]<-ANC[j,]+er_anc[[i]]$ace[as.character(Mn[[i]][j,2]),]
      Nn[j]<-Nn[j]+1
    }
  }
}
for(i in 1:nrow(ANC)) ANC[i,]<-ANC[i,]/Nn[i]
plotTree(compute.brlen(cons_tree,power=0.75),
  direction="upwards",ftype="i",fsize=0.6,
  offset=1)
cols<-setNames(hcl.colors(3)[c(2,3,1)],colnames(ANC))
nodelabels(pie=ANC,piecol=cols,cex=0.4)
tiplabels(pie=to.matrix(activity[cons_tree$tip.label],
  colnames(ANC)),piecol=cols,cex=0.2)
legend("bottomleft",names(cols),pch=21,pt.bg=cols,
  cex=0.8,pt.cex=1.4)

plot of chunk unnamed-chunk-19

(I used compute.brlen because our consensus tree does not have edge lengths and the default branch lengths used by plotTree squished all our recent nodes to close to the tips!)

We probably see that our results are largely congruent across the two approaches, but in the latter there are actually fewer nodes in the consensus tree than for the MCC tree, which is fully resolved.

That’s it for now.

No comments:

Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.