Thursday, October 24, 2024

Revisiting an old phytools function for placing missing taxa into a reconstructed time-tree using quantitative traits: locate.yeti

I was pleased to recently discover that the phytools function locate.yeti for placing a recently extinct taxon (or a fossil lineage, using locate.fossil) into a reconstructed tree based on continuous characters is at least occasionally still being put to use.

This method was described in Revell et al. (2015), employing a case study in the form of the (most-likely) recently extinct lizard Anolis roosevelti.

As a reminder, the idea is that we have a reconstructed tree with N - 1 taxa from (say) molecular data, and that we’d like to rigorously add in an Nth missing taxon based on continuous trait information.

In responding to some inquiries about the function, I thought it would be fun to revisit just how effective the method can be (in theory!) at placing a missing taxon into the tree correctly. (The reason for the disclaimer is that I’m quite certain that this, as we’ll see, very good behavior of the method undoubtedly depends quite quite strongly on the adequacy of our underlying trait evolution model: correlated Brownian motion.)

To start with, let’s load the phytools package as follows.

library(phytools)

Just to make things a bit more interesting, let’s also load the package clusterGeneration which I’ll use (just as we did in Revell et al. 2015) to generate random among-trait covariances matrices.

library(clusterGeneration)

At this point, we’re ready to get going! To start, I’ll imagine the smallest amount of data we could possibly use for this method – a single trait vector.

Each simulation involves generating a random tree, generating a random data vector on the tree under Brownian motion, pruning a single tip at random, then estimating it’s position in the tree using ML based on our continuous trait data following Revell et al. (2015).

Here’s what that looks like.

  tree<-pbtree(n=13,tip.label=LETTERS[1:13])
  X<-fastBM(tree,nsim=1)
  test<-drop.tip(tree,tip<-sample(tree$tip.label,1))
  ml<-locate.yeti(test,X,quiet=TRUE)
  tmp<-cophylo(tree,ml,rotate=FALSE)
  link.lwd<-rep(1,nrow(tmp$assoc))
  link.lwd[which(tmp$assoc[,1]==tip)]<-2
  link.col<-rep("blue",nrow(tmp$assoc))
  link.col[which(tmp$assoc[,1]==tip)]<-"red"
  plot(tmp,link.lwd=link.lwd,link.col=link.col,
    link.type="curved",fsize=0.8)
  legend("topright","missing taxon",lwd=2,lty="dashed",
    col="red",bty="n",cex=0.8)

plot of chunk unnamed-chunk-4

In this example we can see that even with just one character we came kinda close to getting our randomly-chosen missing taxon (labeled B in this example) back into the right place in the tree – that is, at least we hit the right clade.

Let’s do it eight more times to see if this is typical, worse, or better than we do on average – remember, this is if we’re working with only one trait. (Normally, of course, we’d expect to be dealing with quite a few more than one trait!)

par(mfrow=c(3,3))
plot(tmp,link.lwd=link.lwd,link.col=link.col,
  link.type="curved",link.lty="solid",fsize=0.8,
  ylim=c(0,1.05))
legend("topleft","missing\ntaxon",lwd=2,lty="solid",
  col="red",bty="n",cex=0.8)
for(i in 1:8){
  tree<-pbtree(n=13,tip.label=LETTERS[1:13])
  X<-fastBM(tree,nsim=1)
  test<-drop.tip(tree,tip<-sample(tree$tip.label,1))
  ml<-locate.yeti(test,X,quiet=TRUE)
  tmp<-cophylo(tree,ml,rotate=FALSE)
  link.lwd<-rep(1,nrow(tmp$assoc))
  link.lwd[which(tmp$assoc[,1]==tip)]<-2
  link.col<-rep("blue",nrow(tmp$assoc))
  link.col[which(tmp$assoc[,1]==tip)]<-"red"
  plot(tmp,link.lwd=link.lwd,link.col=link.col,
    link.type="curved",link.lty="solid",
    fsize=0.8,ylim=c(0,1.05))
}

plot of chunk unnamed-chunk-5

I’d say that this shows us our result from our first simulation was probably fairly typical – most of our estimates miss the mark, but we’re often close and sometimes land exactly on the generating phylogenetic position of our missing tip.

OK, now let’s do the same thing – but this time with two characters instead of only one.

When moving up to two characters, I’m going to include the added nuance that the characters are evolving with a random evolutionary correlation… they’re non-independent, one from the other. Obviously, this affects our analysis but locate.yeti takes this non-independence into account.

k<-2
par(mfrow=c(3,3))
for(i in 1:9){
  tree<-pbtree(n=13,tip.label=LETTERS[1:13])
  X<-sim.corrs(tree,vcv=genPositiveDefMat(k,
    covMethod="unifcorrmat")$Sigma)
  test<-drop.tip(tree,tip<-sample(tree$tip.label,1))
  ml<-locate.yeti(test,X,quiet=TRUE)
  tmp<-cophylo(tree,ml,rotate=FALSE)
  link.lwd<-rep(1,nrow(tmp$assoc))
  link.lwd[which(tmp$assoc[,1]==tip)]<-2
  link.col<-rep("blue",nrow(tmp$assoc))
  link.col[which(tmp$assoc[,1]==tip)]<-"red"
  plot(tmp,link.lwd=link.lwd,link.col=link.col,
    link.type="curved",fsize=0.8,ylim=c(0,1.05),
    link.lty="solid")
  if(i==1) legend("topleft","missing\ntaxon",
    lwd=2,lty="solid",col="red",bty="n",cex=0.8)
}

plot of chunk unnamed-chunk-6

This should reveal that now we’re more often getting closer to the generating position, though still frequently landing off our mark.

We can move up again in the number of characters, this time to k (our number of characters) equal to 10.

k<-10
par(mfrow=c(3,3))
for(i in 1:9){
  tree<-pbtree(n=13,tip.label=LETTERS[1:13])
  X<-sim.corrs(tree,vcv=genPositiveDefMat(k,
    covMethod="unifcorrmat")$Sigma)
  test<-drop.tip(tree,tip<-sample(tree$tip.label,1))
  ml<-locate.yeti(test,X,quiet=TRUE)
  tmp<-cophylo(tree,ml,rotate=FALSE)
  link.lwd<-rep(1,nrow(tmp$assoc))
  link.lwd[which(tmp$assoc[,1]==tip)]<-2
  link.col<-rep("blue",nrow(tmp$assoc))
  link.col[which(tmp$assoc[,1]==tip)]<-"red"
  plot(tmp,link.lwd=link.lwd,link.col=link.col,
    link.type="curved",fsize=0.8,ylim=c(0,1.05),
    link.lty="solid")
  if(i==1) legend("topleft","missing\ntaxon",
    lwd=2,lty="solid",col="red",bty="n",cex=0.8)
}

plot of chunk unnamed-chunk-7

OK, in this example we’ve hit some right on the head – but we also see that sometimes our missing taxon will attach below the root in the ML solution, which is interesting. (I recall observing that this sometimes happened back when we introduced the method. The likelihood surface can get very flat towards the root of the tree.)

Here’s k = 20.

k<-20
par(mfrow=c(3,3))
for(i in 1:9){
  tree<-pbtree(n=13,tip.label=LETTERS[1:13])
  X<-sim.corrs(tree,vcv=genPositiveDefMat(k,
    covMethod="unifcorrmat")$Sigma)
  test<-drop.tip(tree,tip<-sample(tree$tip.label,1))
  ml<-locate.yeti(test,X,quiet=TRUE)
  tmp<-cophylo(tree,ml,rotate=FALSE)
  link.lwd<-rep(1,nrow(tmp$assoc))
  link.lwd[which(tmp$assoc[,1]==tip)]<-2
  link.col<-rep("blue",nrow(tmp$assoc))
  link.col[which(tmp$assoc[,1]==tip)]<-"red"
  plot(tmp,link.lwd=link.lwd,link.col=link.col,
    link.type="curved",fsize=0.8,ylim=c(0,1.05),
    link.lty="solid")
  if(i==1) legend("topleft","missing\ntaxon",
    lwd=2,lty="solid",col="red",bty="n",cex=0.8)
}

plot of chunk unnamed-chunk-8

Finally, here’s k = 100.

k<-100
par(mfrow=c(3,3))
for(i in 1:9){
  tree<-pbtree(n=13,tip.label=LETTERS[1:13])
  X<-sim.corrs(tree,vcv=genPositiveDefMat(k,
    covMethod="unifcorrmat")$Sigma)
  test<-drop.tip(tree,tip<-sample(tree$tip.label,1))
  ml<-locate.yeti(test,X,quiet=TRUE)
  tmp<-cophylo(tree,ml,rotate=FALSE)
  link.lwd<-rep(1,nrow(tmp$assoc))
  link.lwd[which(tmp$assoc[,1]==tip)]<-2
  link.col<-rep("blue",nrow(tmp$assoc))
  link.col[which(tmp$assoc[,1]==tip)]<-"red"
  plot(tmp,link.lwd=link.lwd,link.col=link.col,
    link.type="curved",fsize=0.8,ylim=c(0,1.05),
    link.lty="solid")
  if(i==1) legend("topleft","missing\ntaxon",
    lwd=2,lty="solid",col="red",bty="n",cex=0.8)
}

plot of chunk unnamed-chunk-9

At this point, we’re getting almost every last one correct (keeping in mind that when sister taxa flip in the unrotated tanglegram, that is the correct topological location).

Lastly, as a related addendum, a couple of years ago I wrote a very simple R package called quantML that could do REML estimation of phylogeny from quantitative traits.

It’s very simple implementation, in that it doesn’t even take into account correlations between traits.

Nonetheless, let’s see how well it could do on estimating the whole phylogeny from a set of 100 quantitative traits. Just to emphasize, here we’re no longer trying to estimate the position of a missing taxon, but of the entire phylogenetic tree topology & its branch lengths.

library(quantML)
start<-phangorn::NJ(dist(X))
qml_tree<-quantml(start,X)
qml_tree
## object of class "quantml"

To see how we’ve done, let’s graph a tanglegram, much as we did earlier. (I’m midpoint-rooting my estimated tree using phangorn, but it would be better if we had included an outgroup.)

tmp<-cophylo(tree,phangorn::midpoint(qml_tree$tree))
## Rotating nodes to optimize matching...
## Done.
plot(tmp)

plot of chunk unnamed-chunk-12

Neat. I bet these trees are quite similar & that you’ll find the same if you replicate my experiment using different data.

I guess the meta point here is that there can be a lot of information about phylogenetic relationships in quantitative trait data that arises under Brownian motion!

That’s it folks!

No comments:

Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.