Thursday, March 27, 2025

A "polymorphic" trait evolution model with a null or absent condition

A junior colleague recently contacted me with the following question (with some of the identifying details removed):

“Apologies for the cold email, and I do hope this is the right address for questions! …. Essentially, I am trying to calculate a polymorphic trait where “absence” is a possible value. Here’s a quick overview of my problem, but I can provide code or further explanation if that helps…. My data is discrete and polymorphic in the sense that a given tip can have four distinct states: (1) none of the relevant genes; (2) gene A; (3) gene B; (4) gene A and B together. I have tried to pass this to the fitpolyMk() function using a Factor with levels of c("-", "A", "B", "A+B" ), where "-" corresponds to strains that lack both gene A and gene B. This computes without error, but it generates a model with design matrix having rows and columns like: - A B -+A -+B A+B -+A+B. Biologically, A is the same thing as -+A. Both are just strains that have gene A and not gene B. Therefore, I’d like to call the function so that the models fit recognize an “empty” trait, and don’t try to compute nodes evolving from A+B into -+A+B. Does that make sense? Is it a complete misuse of the software? I’d be extremely grateful for any advice or feedback you could give!”

This actually makes perfect sense. I know why the user thought that fitpolyMk would be the right method for these data, and I understand the trouble that has arisen by include a “null” or “absent” level of the trait.

The solution is to design a custom model that captures the biological features of the hypothesized process for the trait, in which -A (and the reverse); -B (and the reverse); but to get from A to B one must either transition through - or through A+B, i.e., A-B and AA+BB (and the reverse).

This is what that design matrix might look like. Here, I’ll imagine that there is one rate of gene gain, and another rate of gene loss – but we could also estimate separate rates for each type of change, or assume a single constant rate for all transition types.

Model<-matrix(
  c(
    0,1,0,1,
    2,0,1,0,
    0,2,0,2,
    2,0,2,0),4,4,byrow=TRUE,
  dimnames=list(
    c("-","A","A+B","B"),
    c("-","A","A+B","B"))
)
Model
##     - A A+B B
## -   0 1   0 1
## A   2 0   1 0
## A+B 0 2   0 2
## B   2 0   2 0

Note that the different integer values here are not rates – they are merely being employed to signal to R that we want to estimate different rates for our different transition types.

Now, let’s analyze some data I imagined up that have the same properties as those of our user.

## load packages
library(phytools)
## our tree
tree
## 
## Phylogenetic tree with 200 tips and 199 internal nodes.
## 
## Tip labels:
##   t171, t172, t82, t2, t165, t166, ...
## 
## Rooted; includes branch length(s).
## our data
head(x)
## t171 t172  t82   t2 t165 t166 
##    -    -    -    A    A    A 
## Levels: - A A+B B
summary(x)
##   -   A A+B   B 
##  65  42  19  74

We can right away proceed to fit our model as follows:

## this step is just good housekeeping
X<-to.matrix(x,rownames(Model))
head(X)
##      - A A+B B
## t171 1 0   0 0
## t172 1 0   0 0
## t82  1 0   0 0
## t2   0 1   0 0
## t165 0 1   0 0
## t166 0 1   0 0
## fit our model
fitted_mk_model<-fitMk(tree,X,model=Model,pi="fitzjohn")
fitted_mk_model
## Object of class "fitMk".
## 
## Fitted (or set) value of Q:
##             -         A       A+B         B
## -   -0.020637  0.010319  0.000000  0.010319
## A    0.009697 -0.020015  0.010319  0.000000
## A+B  0.000000  0.009697 -0.019394  0.009697
## B    0.009697  0.000000  0.009697 -0.019394
## 
## Fitted (or set) value of pi:
##        -        A      A+B        B 
## 0.727399 0.110186 0.021389 0.141026 
## due to treating the root prior as (a) nuisance.
## 
## Log-likelihood: -178.047471 
## 
## Optimization method used was "nlminb"
## 
## R thinks it has found the ML solution.

(In practice, we may want to run multiple optimization iterations with different starting conditions to ensure convergence to the ML solution.)

Finally, let’s plot our fitted model to make sure that it has the structure we desired.

plot(fitted_mk_model,show.zeros=FALSE,
  mar=rep(0.1,4),signif=5)

plot of chunk unnamed-chunk-64

Right on.

Just for fun, let’s create a visual of our tree & data as follows.

geneA<-setNames(rep("-",length(x)),
  names(x))
geneA[grep("A",x)]<-"A"
geneB<-setNames(rep("-",length(x)),
  names(x))
geneB[grep("B",x)]<-"B"
genes<-data.frame(A=geneA,B=geneB)
cols<-list(
  setNames(c("#F0EAD6","blue"),c("-","A")),
  setNames(c("#F0EAD6","red"),c("-","B")))
plotFanTree.wTraits(tree,genes,part=0.5,colors=cols,
  arc_height=2,ftype="off")
legend("topleft",c("gene A present","gene B present","gene absent"),
  col=c("blue","red","#F0EAD6"),pch=15,pt.cex=2,bty="n")
plot(fitted_mk_model,add=TRUE,ylim=c(-0.75,2.25),xlim=c(-1.5,1.5),
  show.zeros=FALSE,signif=4,cex.traits=0.7,offset=0.03)

plot of chunk unnamed-chunk-65

Lastly, these data are not those of the phytools user, they were simulated as follows.

library(phytools)
tree<-pbtree(n=200,scale=100)
Q<-matrix(
  c(
    0,0.01,0,0.01,
    0.01,0,0.01,0,
    0,0.01,0,0.01,
    0.01,0,0.01,0),4,4,byrow=TRUE,
  dimnames=list(
    c("-","A","A+B","B"),
    c("-","A","A+B","B"))
)
diag(Q)<--rowSums(Q)
x<-sim.Mk(tree,Q)

No comments:

Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.