Following the Evolution meeting this summer, a grad student reached out to me with a question. Unfortunately, it’s taken me longer than I would’ve hoped to respond to him – but he followed up recently, so I thought I’d give it a quick stab.
The question was as follows:
We met at the Evolution conference last week, because I mentioned having an issue with ancestral reconstruction in my talk. You suggested that I email you after the conference. In short, I want to perform a categorical reconstruction using multiple internal nodes having fixed values.
This is straightforward enough. Unfortunately, there is no “black-box” way to do this – we’re gonna need to code it ourselves.
The trick is that to fix one or any number of internal nodes all we need to do is attach a zero length tip to that node and then assign the known state.
To see how to go about doing that, I’m going to need some data first!
## load phytools
library(phytools)
## simulate a small tree
tree<-pbtree(n=26,tip.label=LETTERS,scale=1)
## set a value of Q
Q<-matrix(c(-1,1,0,1,-2,1,0,1,-1),3,3,
dimnames=list(letters[1:3],letters[1:3]))
## simulate data
xx<-sim.Mk(tree,Q,internal=TRUE)
xx
## A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 27 28 29 30 31 32 33 34 35 36 37
## a b b c c c b c c c c c c c b b b b c c c c c c c b b b b b b c c c c c c
## 38 39 40 41 42 43 44 45 46 47 48 49 50 51
## c c c c c c b b b c c c c c
## Levels: a b c
Here’s our observed data and all of the known states at internal states graphed onto the tree.
plotTree(tree,offset=1,direction="upwards",
lwd=1)
pp<-get("last_plot.phylo",envir=.PlotPhyloEnv)
cols<-setNames(palette.colors(3,"Polychrome 36"),
letters[1:3])
points(pp$xx,pp$yy,pch=16,col=cols[xx],cex=1.5)
Our data include the states at all nodes. This is an unrealistic approximation of our scenario in which it’s more likely that we have a sample of nodes. Since our tree contains 25 internal nodes, let’s imagine that we have values for just 5 of these, as follows.
nn<-xx[sort(sample(1:tree$Nnode+Ntip(tree),10))]
nn
## 29 32 34 35 37 38 42 44 48 49
## b c c c c c c b c c
## Levels: a b c
These data are labeled according to the node indices of our "phylo"
object, as follows.
plotTree(tree,offset=1,direction="upwards",
lwd=1)
nodelabels(cex=0.6,bg="white")
Our tip data, on the other hand, can be subsampled to only the species in our tree.
x<-xx[tree$tip.label]
x
## A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
## a b b c c c b c c c c c c c b b b b c c c c c c c b
## Levels: a b c
For our next step, we’re going to go through and add zero length tips to each of the internal nodes from nn
. The only problem is that our node indices change with every new tip that we add! The trick here is to use phytools::matchNodes
to update the node we’re adding the tip to after every new taxon is added.
nntree<-tree
for(i in 1:length(nn)){
M<-matchNodes(tree,nntree,method="distances")
nntree<-bind.tip(nntree,names(nn)[i],edge.length=0,
where=M[which(M[,1]==as.numeric(names(nn)[i])),2])
}
Let’s check our result.
par(mfrow=c(1,2))
plotTree(tree,fsize=0.8,lwd=1)
nodelabels(cex=0.6)
plotTree(nntree,fsize=0.8,lwd=1)
Really, that looks exactly right.
Next, we can fit our desired Mk model, and we should do so to the full tree & internal states – since, genuinely, known internal states provide us with information about the evolutionary process of our trait. I’m going to fit a custom model that matches our known generating process for the trait.
MODEL<-matrix(c(0,1,0,1,0,1,0,1,0),3,3)
mk_fit<-fitMk(nntree,c(x,nn),model=MODEL)
mk_fit
## Object of class "fitMk".
##
## Fitted (or set) value of Q:
## a b c
## a -0.513978 0.513978 0.000000
## b 0.513978 -1.027956 0.513978
## c 0.000000 0.513978 -0.513978
##
## Fitted (or set) value of pi:
## a b c
## 0.333333 0.333333 0.333333
## due to treating the root prior as (a) flat.
##
## Log-likelihood: -15.088024
##
## Optimization method used was "nlminb"
##
## R thinks it has found the ML solution.
Now to get our ancestral states, we just need to pass this fitted model object to phytools::ancr
.
anc_mk<-ancr(mk_fit)
anc_mk
## Marginal ancestral state estimates:
## a b c
## 37 0.150557 0.629894 0.219549
## 38 0.019778 0.705542 0.274680
## 39 0.000000 1.000000 0.000000
## 40 0.000199 0.953008 0.046794
## 41 0.002164 0.981934 0.015902
## 42 0.000000 0.000000 1.000000
## ...
##
## Log-likelihood = -15.088024
We can plot our results.
plot(anc_mk,
args.plotTree=list(direction="upwards"),
args.tiplabels=list(cex=0.4))
For fun, let’s compare this result to the true values and to those that we would’ve obtained without the known internal states.
par(mfrow=c(1,3))
plotTree(tree,offset=0.5,lwd=1,ftype="i",
mar=c(0.1,1.1,2.1,0.1))
pp<-get("last_plot.phylo",envir=.PlotPhyloEnv)
cols<-setNames(palette.colors(3,"Polychrome 36"),
letters[1:3])
points(pp$xx,pp$yy,pch=16,col=cols[xx],cex=1.5)
mtext("a) true states",line=0,adj=0,cex=0.7)
plot(anc_mk,
args.plotTree=list(mar=c(0.1,1.1,2.1,0.1),fsize=0.8),
args.tiplabels=list(cex=0.8),
args.nodelabels=list(cex=0.8),
legend=FALSE)
mtext("b) estimates with some known nodes",line=0,adj=0,
cex=0.7)
plot(ancr(fitMk(tree,x,model=MODEL)),
args.plotTree=list(mar=c(0.1,1.1,2.1,0.1),fsize=0.8),
args.tiplabels=list(cex=0.8),
args.nodelabels=list(cex=0.8),
legend=FALSE)
mtext("c) estimates based only on tip data",line=0,adj=0,
cex=0.7)
In general, we should find that our internal nodes are more accurate, and more confident – particularly when close to known nodes of the tree – than when only tip data is used.
That’s pretty much it!