Friday, May 31, 2024

Fitting an Mk model & reconstructing ancestral states when some levels of the character are unobserved among tip species of the tree

A colleague recently asked me about ancestral state reconstruction when some values of our discrete character should exist, but have not been observed.

An example of this might be a meristic (counted) character, in which elements are gained and lost sequentially. Perhaps we only observe species with 21, 22, 24, 26, 27, and 29 teeth in the upper-jaw (say), but it might make sense to model the evolution of this counted feature as if the unobserved levels of 23, 25, and 28 could exist, and might even have been observed in some of the ancestors of the taxa in our set.

To illustrate how we can go about fitting a discrete character (Mk) trait evolution model in which one or more states exist but are unobserved, I’m first going to need to simulate data that have this property!

To do that, I’ll generate data for an ordered character with levels 0 through 9, but in which the condition 1 & 3 are not seen among living taxa. The seed (1238, below) used in the simulation below was carefully chosen so that the data generated have this feature! Try a different seed & you’ll see why that had to be the case.

## load phytools
library(phytools)
## set seed
set.seed(1238)
## simulation tree
tree<-pbtree(n=100,scale=10)
## create ordered transition matrix
Q<-matrix(0,10,10,dimnames=list(0:9,0:9))
for(i in 2:nrow(Q)) Q[i-1,i]<-Q[i,i-1]<-1
diag(Q)<--rowSums(Q)
## simulate data
x<-sim.Mk(tree,Q)
head(x,20)
## t53 t54 t35 t78 t79 t66  t8 t24 t25 t10 t11 t20 t30 t31 t16 t17 t83 t84 t57 t58 
##   9   8   5   7   7   8   6   2   4   7   8   7   8   9   4   8   9   9   6   6 
## Levels: 0 2 4 5 6 7 8 9

We can see the frequencies of all the observed levels of our trait by running summary on this factor.

summary(x)
##  0  2  4  5  6  7  8  9 
##  3  6  6 13 10 22 16 24

Our first step towards fitting our model and reconstructing ancestral states will be to convert our factor vector into a binary matrix. This can be done using the function phytools::to.matrix as follows.

X<-to.matrix(x,seq=0:9)
head(X)
##     0 1 2 3 4 5 6 7 8 9
## t53 0 0 0 0 0 0 0 0 0 1
## t54 0 0 0 0 0 0 0 0 1 0
## t35 0 0 0 0 0 1 0 0 0 0
## t78 0 0 0 0 0 0 0 1 0 0
## t79 0 0 0 0 0 0 0 1 0 0
## t66 0 0 0 0 0 0 0 0 1 0

Note that we see columns for all 0 through 9 levels of our trait, not just for the values we observed in our tip species.

I’m going to apply colSums to this matrix to confirm that our frequencies match what we’d obtained from summary.

colSums(X)
##  0  1  2  3  4  5  6  7  8  9 
##  3  0  6  0  6 13 10 22 16 24

So far so good.

Next, we’ll construct a design matrix for the model that we intend to fit. In general, we should not expect to be able to estimate separate rates of evolution to & from character state levels not found in our observed data.

In this case, our model might be one in which the rate of change between all adjacent character levels is equal, but different in the backward & forward directions. (In fact, in our generating model, above, these rates were equal – but I thought it might be useful to illustrate this slightly more complex design matrix as it should also correspond to a model that is identifiable for this type of data.)

ordered_model<-matrix(0,10,10,
  dimnames=list(colnames(X),colnames(X)))
for(i in 2:nrow(ordered_model)){
  ordered_model[i-1,i]<-1
  ordered_model[i,i-1]<-2
}
ordered_model
##   0 1 2 3 4 5 6 7 8 9
## 0 0 1 0 0 0 0 0 0 0 0
## 1 2 0 1 0 0 0 0 0 0 0
## 2 0 2 0 1 0 0 0 0 0 0
## 3 0 0 2 0 1 0 0 0 0 0
## 4 0 0 0 2 0 1 0 0 0 0
## 5 0 0 0 0 2 0 1 0 0 0
## 6 0 0 0 0 0 2 0 1 0 0
## 7 0 0 0 0 0 0 2 0 1 0
## 8 0 0 0 0 0 0 0 2 0 1
## 9 0 0 0 0 0 0 0 0 2 0

Now we can proceed to fit the model in the standard way using phytools::fitMk.

fit_ordered<-fitMk(tree,X,model=ordered_model,
  pi="fitzjohn")
fit_ordered
## Object of class "fitMk".
## 
## Fitted (or set) value of Q:
##           0         1         2         3         4         5         6         7         8         9
## 0 -0.984536  0.984536  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000
## 1  0.907888 -1.892424  0.984536  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000
## 2  0.000000  0.907888 -1.892424  0.984536  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000
## 3  0.000000  0.000000  0.907888 -1.892424  0.984536  0.000000  0.000000  0.000000  0.000000  0.000000
## 4  0.000000  0.000000  0.000000  0.907888 -1.892424  0.984536  0.000000  0.000000  0.000000  0.000000
## 5  0.000000  0.000000  0.000000  0.000000  0.907888 -1.892424  0.984536  0.000000  0.000000  0.000000
## 6  0.000000  0.000000  0.000000  0.000000  0.000000  0.907888 -1.892424  0.984536  0.000000  0.000000
## 7  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.907888 -1.892424  0.984536  0.000000
## 8  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.907888 -1.892424  0.984536
## 9  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.907888 -0.907888
## 
## Fitted (or set) value of pi:
##        0        1        2        3        4        5        6        7        8        9 
## 0.000003 0.000026 0.000290 0.002361 0.013231 0.049856 0.126197 0.220321 0.283876 0.303839 
## due to treating the root prior as (a) nuisance.
## 
## Log-likelihood: -170.454534 
## 
## Optimization method used was "nlminb"
## 
## R thinks it has found the ML solution.

Here’s a plot of our fitted model.

plot(fit_ordered,show.zeros=FALSE,mar=rep(0.1,4))

plot of chunk unnamed-chunk-10

Excellent. Now we can proceed straight away to run ancr on this fitted model object.

anc_ordered<-ancr(fit_ordered)
anc_ordered
## Marginal ancestral state estimates:
##         0       1        2        3        4        5        6        7        8        9
## 101 0e+00 0.0e+00 0.000000 0.000023 0.000729 0.010355 0.066347 0.202224 0.335722 0.384599
## 102 0e+00 0.0e+00 0.000003 0.000101 0.001773 0.016679 0.081737 0.212349 0.323857 0.363501
## 103 0e+00 1.0e-06 0.000015 0.000262 0.002929 0.020128 0.082360 0.201228 0.316120 0.376957
## 104 5e-06 3.2e-05 0.000302 0.002212 0.011505 0.041851 0.107343 0.200249 0.288364 0.348138
## 105 9e-06 4.9e-05 0.000387 0.002511 0.012045 0.041789 0.105283 0.197214 0.288288 0.352427
## 106 0e+00 0.0e+00 0.000000 0.000000 0.000006 0.000227 0.005401 0.066826 0.349219 0.578321
## ...
## 
## Log-likelihood = -170.454534

Here’s a plot of our marginal states using the "ancr" plot method. We can see that unobserved character levels 1 and 3 are indeed reconstructed for some of the internal nodes of the tree.

plot(anc_ordered,
  args.plotTree=list(type="arc",arc_height=0.4,ftype="off"),
  args.nodelabels=list(piecol=hcl.colors(n=10),cex=0.5),
  args.tiplabels=list(cex=0.3),
  legend=FALSE)
## calculate the spacing of the legend
obj<-legend(x="bottomleft",legend=0:9,
  pch=16,col=hcl.colors(n=10),horiz=TRUE,
  bty="n",cex=0.7,pt.cex=1.4,plot=FALSE)
legend(x=-0.5*obj$rect$w,y=0,
  legend=0:9,pch=16,col=hcl.colors(n=10),horiz=TRUE,
  bty="n",cex=0.7,pt.cex=1.4)

plot of chunk unnamed-chunk-12

(One trick that I used here to center the legend on the left-right plane was both first “plotting” the legend with plot=FALSE. This invisibly returns the coordinates and (importantly for us) the width of the plotted legend in user units. We can use this information to re-position the legend into a desirable spot when we get around to graphing it.)

I personally like the new phytools function plotFanTree.wTraits, so let’s use that instead.

plotFanTree.wTraits(tree,data.frame(factor(x,levels=0:9)),
  part=0.5,colors=list(setNames(hcl.colors(n=10),0:9)),
  ftype="off")
par(fg="transparent")
nodelabels(pie=anc_ordered$ace,piecol=hcl.colors(n=10),
  cex=0.5)
par(fg="black")
obj<-legend(x="bottomleft",legend=0:9,
  pch=16,col=hcl.colors(n=10),horiz=TRUE,
  bty="n",cex=0.7,pt.cex=1.4,plot=FALSE)
legend(x=-0.5*obj$rect$w,y=0,
  legend=0:9,pch=16,col=hcl.colors(n=10),horiz=TRUE,
  bty="n",cex=0.7,pt.cex=1.4)

plot of chunk unnamed-chunk-13

It’s even possible to “reconstruct” ancestral states using this pipeline if our data are totally invariant – i.e., if all species have the same observed condition.

This isn’t of great practical utility in phylogenetic comparative biology – but perhaps we want to iterate a workflow across a set of binary character, some of which exhibit both 0 and 1 levels, and others which are fixed for one or the other. (In fact, I can see how this might be useful!)

## our invariant trait
y<-setNames(rep(1,Ntip(tree)),tree$tip.label)
## convert to 0/1 matrix
Y<-to.matrix(y,0:1)
## fit model
fit_invariant<-fitMk(tree,Y,pi="fitzjohn")
fit_invariant
## Object of class "fitMk".
## 
## Fitted (or set) value of Q:
##   0 1
## 0 0 0
## 1 0 0
## 
## Fitted (or set) value of pi:
## 0 1 
## 0 1 
## due to treating the root prior as (a) nuisance.
## 
## Log-likelihood: 0 
## 
## Optimization method used was "nlminb"
## 
## R thinks it has found the ML solution.
## marginal ASR
anc_invariant<-ancr(fit_invariant)
anc_invariant
## Marginal ancestral state estimates:
##     0 1
## 101 0 1
## 102 0 1
## 103 0 1
## 104 0 1
## 105 0 1
## 106 0 1
## ...
## 
## Log-likelihood = 0
## plot our results
plot(anc_invariant,
  args.plotTree=list(direction="upwards"),
  args.nodelabels=list(piecol=hcl.colors(n=2)))

plot of chunk unnamed-chunk-16

See. It’s not hard.

That’s all folks!