Monday, September 8, 2025

Ancestral states for a discrete trait when some nodes are known

Following the Evolution meeting this summer, a grad student reached out to me with a question. Unfortunately, it’s taken me longer than I would’ve hoped to respond to him – but he followed up recently, so I thought I’d give it a quick stab.

The question was as follows:

We met at the Evolution conference last week, because I mentioned having an issue with ancestral reconstruction in my talk. You suggested that I email you after the conference. In short, I want to perform a categorical reconstruction using multiple internal nodes having fixed values.

This is straightforward enough. Unfortunately, there is no “black-box” way to do this – we’re gonna need to code it ourselves.

The trick is that to fix one or any number of internal nodes all we need to do is attach a zero length tip to that node and then assign the known state.

To see how to go about doing that, I’m going to need some data first!

## load phytools
library(phytools)
## simulate a small tree
tree<-pbtree(n=26,tip.label=LETTERS,scale=1)
## set a value of Q
Q<-matrix(c(-1,1,0,1,-2,1,0,1,-1),3,3,
  dimnames=list(letters[1:3],letters[1:3]))
## simulate data
xx<-sim.Mk(tree,Q,internal=TRUE)
xx
##  A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z 27 28 29 30 31 32 33 34 35 36 37 
##  a  b  b  c  c  c  b  c  c  c  c  c  c  c  b  b  b  b  c  c  c  c  c  c  c  b  b  b  b  b  b  c  c  c  c  c  c 
## 38 39 40 41 42 43 44 45 46 47 48 49 50 51 
##  c  c  c  c  c  c  b  b  b  c  c  c  c  c 
## Levels: a b c

Here’s our observed data and all of the known states at internal states graphed onto the tree.

plotTree(tree,offset=1,direction="upwards",
  lwd=1)
pp<-get("last_plot.phylo",envir=.PlotPhyloEnv)
cols<-setNames(palette.colors(3,"Polychrome 36"),
  letters[1:3])
points(pp$xx,pp$yy,pch=16,col=cols[xx],cex=1.5)

plot of chunk unnamed-chunk-97

Our data include the states at all nodes. This is an unrealistic approximation of our scenario in which it’s more likely that we have a sample of nodes. Since our tree contains 25 internal nodes, let’s imagine that we have values for just 5 of these, as follows.

nn<-xx[sort(sample(1:tree$Nnode+Ntip(tree),10))]
nn
## 29 32 34 35 37 38 42 44 48 49 
##  b  c  c  c  c  c  c  b  c  c 
## Levels: a b c

These data are labeled according to the node indices of our "phylo" object, as follows.

plotTree(tree,offset=1,direction="upwards",
  lwd=1)
nodelabels(cex=0.6,bg="white")

plot of chunk unnamed-chunk-99

Our tip data, on the other hand, can be subsampled to only the species in our tree.

x<-xx[tree$tip.label]
x
## A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 
## a b b c c c b c c c c c c c b b b b c c c c c c c b 
## Levels: a b c

For our next step, we’re going to go through and add zero length tips to each of the internal nodes from nn. The only problem is that our node indices change with every new tip that we add! The trick here is to use phytools::matchNodes to update the node we’re adding the tip to after every new taxon is added.

nntree<-tree
for(i in 1:length(nn)){
  M<-matchNodes(tree,nntree,method="distances")
  nntree<-bind.tip(nntree,names(nn)[i],edge.length=0,
    where=M[which(M[,1]==as.numeric(names(nn)[i])),2])
}

Let’s check our result.

par(mfrow=c(1,2))
plotTree(tree,fsize=0.8,lwd=1)
nodelabels(cex=0.6)
plotTree(nntree,fsize=0.8,lwd=1)

plot of chunk unnamed-chunk-102

Really, that looks exactly right.

Next, we can fit our desired Mk model, and we should do so to the full tree & internal states – since, genuinely, known internal states provide us with information about the evolutionary process of our trait. I’m going to fit a custom model that matches our known generating process for the trait.

MODEL<-matrix(c(0,1,0,1,0,1,0,1,0),3,3)
mk_fit<-fitMk(nntree,c(x,nn),model=MODEL)
mk_fit
## Object of class "fitMk".
## 
## Fitted (or set) value of Q:
##           a         b         c
## a -0.513978  0.513978  0.000000
## b  0.513978 -1.027956  0.513978
## c  0.000000  0.513978 -0.513978
## 
## Fitted (or set) value of pi:
##        a        b        c 
## 0.333333 0.333333 0.333333 
## due to treating the root prior as (a) flat.
## 
## Log-likelihood: -15.088024 
## 
## Optimization method used was "nlminb"
## 
## R thinks it has found the ML solution.

Now to get our ancestral states, we just need to pass this fitted model object to phytools::ancr.

anc_mk<-ancr(mk_fit)
anc_mk
## Marginal ancestral state estimates:
##           a        b        c
## 37 0.150557 0.629894 0.219549
## 38 0.019778 0.705542 0.274680
## 39 0.000000 1.000000 0.000000
## 40 0.000199 0.953008 0.046794
## 41 0.002164 0.981934 0.015902
## 42 0.000000 0.000000 1.000000
## ...
## 
## Log-likelihood = -15.088024

We can plot our results.

plot(anc_mk,
  args.plotTree=list(direction="upwards"),
  args.tiplabels=list(cex=0.4))

plot of chunk unnamed-chunk-105

For fun, let’s compare this result to the true values and to those that we would’ve obtained without the known internal states.

par(mfrow=c(1,3))
plotTree(tree,offset=0.5,lwd=1,ftype="i",
  mar=c(0.1,1.1,2.1,0.1))
pp<-get("last_plot.phylo",envir=.PlotPhyloEnv)
cols<-setNames(palette.colors(3,"Polychrome 36"),
  letters[1:3])
points(pp$xx,pp$yy,pch=16,col=cols[xx],cex=1.5)
mtext("a) true states",line=0,adj=0,cex=0.7)
plot(anc_mk,
  args.plotTree=list(mar=c(0.1,1.1,2.1,0.1),fsize=0.8),
  args.tiplabels=list(cex=0.8),
  args.nodelabels=list(cex=0.8),
  legend=FALSE)
mtext("b) estimates with some known nodes",line=0,adj=0,
  cex=0.7)
plot(ancr(fitMk(tree,x,model=MODEL)),
  args.plotTree=list(mar=c(0.1,1.1,2.1,0.1),fsize=0.8),
  args.tiplabels=list(cex=0.8),
  args.nodelabels=list(cex=0.8),
  legend=FALSE)
mtext("c) estimates based only on tip data",line=0,adj=0,
  cex=0.7)

plot of chunk unnamed-chunk-106

In general, we should find that our internal nodes are more accurate, and more confident – particularly when close to known nodes of the tree – than when only tip data is used.

That’s pretty much it!

No comments:

Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.