Friday, February 10, 2023

Model-comparison, ancestral state reconstruction, and stochastic mapping with fitpolyMk

Since 2019 phytools has contained a function (fitpolyMk) for fitting a polymorphic trait evolution model. This is a relatively simple model in which the polymorphic condition (e.g., a + b) is intermediate between the two, corresponding monomorphic states. fitpolyMk has no problem with higher degrees of polymorphism in which, for example, the condition a + b + c might be intermediate between a + b and b + c.

I’ve presented on this model several times and I get lots of questions about it, so I thought it might be worthwhile to illustrate some ways in which the model & function can be used.

For this example I’m going to use simulated data for an imaginary clade of lizards in which individual species might be ground-dwelling, bush-dwelling, tree-dwelling, and perhaps various combinations of these three fundamental states. It might be the case that evolutionary transitions occur ground ↔ bush ↔ tree (with all the corresponding polymorphic conditions in between) – but we’re going to imagine that we don’t know this a priori.

At the end I’ll reveal how the data were, in fact, simulated.

Note that to fully follow along with this post (as of the time of writing) you’ll need to update phytools from GitHub. This can be done with (for example) using the remotes CRAN package.

packageVersion("phytools")
## [1] '1.4.1'
library(phytools)

So, here’s our data.

plotTree(lizard.tree,ftype="off",lwd=1,type="fan")
X<-strsplit(setNames(as.character(habs),names(habs)),"+",
	fixed=TRUE)
pies<-matrix(0,Ntip(lizard.tree),3,
	dimnames=list(lizard.tree$tip.label,
	c("ground","bush","tree")))
for(i in 1:Ntip(lizard.tree)) 
	pies[lizard.tree$tip.label[i],
		X[[lizard.tree$tip.label[i]]]]<-
		rep(1/length(X[[lizard.tree$tip.label[i]]]),
		length(X[[lizard.tree$tip.label[i]]]))
cols<-c("gold1","olivedrab3","darkgreen")
par(fg="transparent")
tiplabels(pie=pies,piecol=cols,cex=0.3)
par(fg="black")
legend(x="topleft",legend=c("ground","bush","tree"),
	pt.cex=2,pch=16,col=cols)

plot of chunk unnamed-chunk-2

(Someone who’s not colorblind can tell me whether or not that’s a good palette.)

We can see from this plot that each tip of the tree is in the conditions ground, bush, tree, or some combination of states.

To start, let’s examine the endeavor of picking the best model.

When we do this, we can think of two broad flavors of polymorphic trait evolution model: one in which the monomorphic conditions have some kind of implied order; and a second in which they do not.

A good example of an ordered polymorphic character might be a meristic trait: such at the number of scales at the midline or the number of caudal vertebrae.

As a small sidebar, it’s probably worth noting that the most general unordered polymorphic trait evolution model (that is, model="ARD") is guaranteed to have a higher likelihood than all ordered models, because it has them as a special case in which the transition rates between certain states or state combinations are zero.

Here, I’ll fit just three variants of the unordered polymorphic trait evolution: "ER" (all transition rates equal); "transient"; and "ARD".

unordered.er<-fitpolyMk(lizard.tree,habs,model="ER")
## 
## This is the design matrix of the fitted model. Does it make sense?
## 
##                  bush ground tree bush+ground bush+tree ground+tree bush+ground+tree
## bush                0      0    0           1         1           0                0
## ground              0      0    0           1         0           1                0
## tree                0      0    0           0         1           1                0
## bush+ground         1      1    0           0         0           0                1
## bush+tree           1      0    1           0         0           0                1
## ground+tree         0      1    1           0         0           0                1
## bush+ground+tree    0      0    0           1         1           1                0
unordered.transient<-fitpolyMk(lizard.tree,habs,
	model="transient")
## 
## This is the design matrix of the fitted model. Does it make sense?
## 
##                  bush ground tree bush+ground bush+tree ground+tree bush+ground+tree
## bush                0      0    0           2         2           0                0
## ground              0      0    0           2         0           2                0
## tree                0      0    0           0         2           2                0
## bush+ground         1      1    0           0         0           0                2
## bush+tree           1      0    1           0         0           0                2
## ground+tree         0      1    1           0         0           0                2
## bush+ground+tree    0      0    0           1         1           1                0
unordered.ard<-fitpolyMk(lizard.tree,habs,model="ARD")
## 
## This is the design matrix of the fitted model. Does it make sense?
## 
##                  bush ground tree bush+ground bush+tree ground+tree bush+ground+tree
## bush                0      0    0           1         3           0                0
## ground              0      0    0           5         0           7                0
## tree                0      0    0           0         9          11                0
## bush+ground         2      6    0           0         0           0               13
## bush+tree           4      0   10           0         0           0               15
## ground+tree         0      8   12           0         0           0               17
## bush+ground+tree    0      0    0          14        16          18                0
layout(matrix(c(1,1,2,2,4,3,3,4),4,2))
plot(unordered.er)
mtext(paste("a) unordered ER, AIC:",
	round(AIC(unordered.er),2)),adj=0)
plot(unordered.transient)
mtext(paste("b) unordered transient, AIC:",
	round(AIC(unordered.transient),2)),adj=0)
plot(unordered.ard)
mtext(paste("c) unordered ARD, AIC:",
	round(AIC(unordered.ard),2)),adj=0)

plot of chunk unnamed-chunk-3

It’s evident which model is best-supported by our data among these three, but we can also compare them using a new anova method for the model class as follows.

anova(unordered.er,unordered.transient,unordered.ard)
##                        log(L) d.f.      AIC       weight
## unordered.er        -265.1739    1 532.3479 1.106000e-09
## unordered.transient -243.6807    2 491.3613 8.787669e-01
## unordered.ard       -229.6615   18 495.3229 1.212331e-01

This tells us fairly definitively that among the three models we’ve evaluated so far, the so-called “transient” model (featuring one rate for the acquisition of polymorphism and a second rate for its loss) is the best-supported by our data.

The other major flavor of polymorphic trait evolution that we’ll consider is the ordered model. We can specify this model class by setting the fitpolyMk argument ordered=TRUE as follows.

ordered.er<-fitpolyMk(lizard.tree,habs,model="ER",ordered=TRUE)
## 
## This is the design matrix of the fitted model. Does it make sense?
## 
##                  bush bush+ground bush+ground+tree ground ground+tree tree
## bush                0           1                0      0           0    0
## bush+ground         1           0                1      1           0    0
## bush+ground+tree    0           1                0      0           1    0
## ground              0           1                0      0           1    0
## ground+tree         0           0                1      1           0    1
## tree                0           0                0      0           1    0
ordered.er
## Object of class "fitpolyMk".
## 
## Evolution was modeled as 'ordered' (i.e., transitions are assumed
## to occur bush <-> bush+ground <-> bush+ground+tree and so on) using the "ER" model.
## 
## Fitted (or set) value of Q:
##                  bush bush+ground bush+ground+tree ground ground+tree tree
## bush              NaN         NaN                0      0           0    0
## bush+ground       NaN         NaN              NaN    NaN           0    0
## bush+ground+tree    0         NaN              NaN      0         NaN    0
## ground              0         NaN                0    NaN         NaN    0
## ground+tree         0           0              NaN    NaN         NaN  NaN
## tree                0           0                0      0         NaN  NaN
## 
## Fitted (or set) value of pi:
##             bush      bush+ground bush+ground+tree           ground      ground+tree 
##         0.166667         0.166667         0.166667         0.166667         0.166667 
##             tree 
##         0.166667 
## 
## Log-likelihood: -1e+50 
## 
## Optimization method used was "nlminb"

OK, something’s out-of-place here. When we take a closer at both our fitted model and our data we see the following mismatch.

ordered.er$states
## [1] "bush"             "bush+ground"      "bush+ground+tree" "ground"          
## [5] "ground+tree"      "tree"
levels(habs)
## [1] "bush"             "bush+tree"        "ground"           "ground+bush"     
## [5] "ground+bush+tree" "tree"

The default ordering of our monomorphic condition has created a set of polymorphic conditions that is different from the set of conditions that we observe in our data!

Is there an ordered process that would produce the conditions that we observe? Yes, in fact, and it is the one we hypothesized at the beginning of this article: ground ↔ bush ↔ tree (with all the polymorphisms in between, of course).

This specific ordering hypothesis can be specified using the fitpolyMk argument order (as opposed to ordered, which is a logical value), as follows.

levs<-c("ground","bush","tree")
ordered.er<-fitpolyMk(lizard.tree,habs,model="ER",
	ordered=TRUE,order=levs)
## 
## This is the design matrix of the fitted model. Does it make sense?
## 
##                  ground bush+ground bush+ground+tree bush bush+tree tree
## ground                0           1                0    0         0    0
## bush+ground           1           0                1    1         0    0
## bush+ground+tree      0           1                0    0         1    0
## bush                  0           1                0    0         1    0
## bush+tree             0           0                1    1         0    1
## tree                  0           0                0    0         1    0
ordered.transient<-fitpolyMk(lizard.tree,habs,
	model="transient",ordered=TRUE,order=levs)
## 
## This is the design matrix of the fitted model. Does it make sense?
## 
##                  ground bush+ground bush+ground+tree bush bush+tree tree
## ground                0           2                0    0         0    0
## bush+ground           1           0                2    1         0    0
## bush+ground+tree      0           1                0    0         1    0
## bush                  0           2                0    0         2    0
## bush+tree             0           0                2    1         0    1
## tree                  0           0                0    0         2    0
ordered.ard<-fitpolyMk(lizard.tree,habs,model="ARD",
	ordered=TRUE,order=levs)
## 
## This is the design matrix of the fitted model. Does it make sense?
## 
##                  ground bush+ground bush+ground+tree bush bush+tree tree
## ground                0           1                0    0         0    0
## bush+ground           2           0                3    5         0    0
## bush+ground+tree      0           4                0    0         7    0
## bush                  0           6                0    0         9    0
## bush+tree             0           0                8   10         0   11
## tree                  0           0                0    0        12    0

Let’s graph these three fitted models as we did before.

layout(matrix(c(1,1,2,2,4,3,3,4),4,2))
plot(ordered.er)
mtext(paste("a) ordered ER, AIC:",
	round(AIC(ordered.er),2)),adj=0)
plot(ordered.transient)
mtext(paste("b) ordered transient, AIC:",
	round(AIC(ordered.transient),2)),adj=0)
plot(ordered.ard)
mtext(paste("c) ordered ARD, AIC:",
	round(AIC(ordered.ard),2)),adj=0)

plot of chunk unnamed-chunk-8

As before, we can also compare all of these models using an anova call; however, in this case we’ll also include our first set of three unordered models, as follows.

anova(ordered.er,unordered.er,
	ordered.transient,unordered.transient,
	ordered.ard,unordered.ard)
##                        log(L) d.f.      AIC       weight
## ordered.er          -248.6455    1 499.2910 7.034847e-07
## unordered.er        -265.1739    1 532.3479 4.667033e-14
## ordered.transient   -233.4797    2 470.9594 9.985836e-01
## unordered.transient -243.6807    2 491.3613 3.708169e-05
## ordered.ard         -230.0686   12 484.1373 1.373518e-03
## unordered.ard       -229.6615   18 495.3229 5.115723e-06

With the entire set of models in hand, it’s now evident that the best-supported model in the six is clearly the ordered transient model.

For a next step, let’s try to use our fitted model to undertake ancestral state reconstruction.

For this, I prefer the CRAN package corHMM; however, to use corHMM we’re going to first have to do some bookkeeping with our data and fitted model.

To see this, let’s first load the package.

library(corHMM)

Next, I’m going to pull my states off the fitted model object. It’s important that I pull them from there because this ensures that (say) a+b and b+a have been synonymized (as they should be).

xx<-apply(ordered.transient$data,1,
	function(x,ss) ss[which(x==1)],
	ss=colnames(ordered.transient$data))

I also discovered that corHMM does not like the + character in state names. No problem. We’ll use gsub to subsitute /.

xx<-gsub("+","/",xx,fixed=TRUE)

Now let’s build the data frame that corHMM needs as input.

lizard.data<-data.frame(Genus_sp=names(xx),
	habitat=xx)
head(lizard.data)
##      Genus_sp habitat
## t220     t220    tree
## t221     t221    tree
## t199     t199    tree
## t36       t36    tree
## t37       t37    tree
## t81       t81    tree

So far so good. Next, I’ll pull out my model design matrix from the fitted object. Here, I need to replace all the zeroes with NA, rename the rows & columns substituting / for + as I did in my state data, and reorder alphabetically.

rate.mat<-ordered.transient$index.matrix
rate.mat[rate.mat==0]<-NA
colnames(rate.mat)<-rownames(rate.mat)<-gsub("+","/",
	colnames(rate.mat),fixed=TRUE)
ind<-order(colnames(rate.mat))
rate.mat<-rate.mat[ind,ind]
rate.mat
##                  bush bush/ground bush/ground/tree bush/tree ground tree
## bush               NA           2               NA         2     NA   NA
## bush/ground         1          NA                2        NA      1   NA
## bush/ground/tree   NA           1               NA         1     NA   NA
## bush/tree           1          NA                2        NA     NA    1
## ground             NA           2               NA        NA     NA   NA
## tree               NA          NA               NA         2     NA   NA

Finally, I’m ready to get my ancestral states!

fit.marginal<-corHMM(lizard.tree,lizard.data,
	rate.mat=rate.mat,
	node.states="marginal",
	rate.cat=1,p=ordered.transient$rates,
	root.p=ordered.transient$pi)
## State distribution in data:
## States:	1	2	3	4	5	6	
## Counts:	91	33	5	36	14	121	
## Calculating likelihood from a set of fixed parameters 
## Finished. Inferring ancestral states using marginal reconstruction.
fit.marginal
## 
## Fit
##       -lnL      AIC     AICc Rate.cat ntax
##  -233.4797 470.9594 470.9998        1  300
## 
## Legend
##                  1                  2                  3                  4 
##             "bush"      "bush/ground" "bush/ground/tree"        "bush/tree" 
##                  5                  6 
##           "ground"             "tree" 
## 
## Rates
##          (1,R1)    (2,R1)    (3,R1)    (4,R1)   (5,R1)   (6,R1)
## (1,R1)       NA 0.6716599        NA 0.6716599       NA       NA
## (2,R1) 2.056186        NA 0.6716599        NA 2.056186       NA
## (3,R1)       NA 2.0561858        NA 2.0561858       NA       NA
## (4,R1) 2.056186        NA 0.6716599        NA       NA 2.056186
## (5,R1)       NA 0.6716599        NA        NA       NA       NA
## (6,R1)       NA        NA        NA 0.6716599       NA       NA
## 
## Arrived at a reliable solution

I should see that the likelihood matches what I obtained from fitpolyMk, which is a good sign.

The marginal states are stored in fit.marginal$states matrix. Let’s pull that out and do some our previous bookkeeping again, but this time in reverse.

asr<-fit.marginal$states
colnames(asr)<-colnames(rate.mat)
colnames(asr)<-gsub("/","+",colnames(asr))
head(asr)
##             bush  bush+ground bush+ground+tree  bush+tree       ground       tree
## [1,] 0.173153538 9.473746e-03     0.1274649528 0.67980046 1.212651e-05 0.01009518
## [2,] 0.258401494 9.076941e-03     0.0865099912 0.61484627 1.184951e-05 0.03115345
## [3,] 0.279675949 8.692735e-03     0.0607896743 0.59017966 1.823457e-05 0.06064374
## [4,] 0.045162574 2.810016e-04     0.0055312345 0.37299114 2.453109e-07 0.57603380
## [5,] 0.004614478 1.572653e-05     0.0013535985 0.22013706 1.738418e-09 0.77387914
## [6,] 0.000263540 1.605630e-06     0.0002711173 0.04837474 4.856456e-10 0.95108900

Last of all, plot our reconstructed states on the tree.

plotTree(lizard.tree,ftype="off",lwd=1,type="fan")
X<-strsplit(setNames(as.character(habs),names(habs)),"+",
	fixed=TRUE)
pies<-matrix(0,Ntip(lizard.tree),3,
	dimnames=list(lizard.tree$tip.label,
	c("ground","bush","tree")))
for(i in 1:Ntip(lizard.tree)) 
	pies[lizard.tree$tip.label[i],
		X[[lizard.tree$tip.label[i]]]]<-
		rep(1/length(X[[lizard.tree$tip.label[i]]]),
		length(X[[lizard.tree$tip.label[i]]]))
cols<-setNames(c("gold1","olivedrab3","darkgreen"),
	c("ground","bush","tree"))
par(fg="transparent")
tiplabels(pie=pies,piecol=cols,cex=0.3)
par(fg="black")
piecol<-vector()
for(i in 1:ncol(asr)){
	nn<-strsplit(colnames(asr)[i],"+",fixed=TRUE)[[1]]
	if(length(nn)==1) piecol[i]<-cols[nn]
	else if(length(nn)==2) piecol[i]<-colorRampPalette(cols[nn])(3)[2]
	else piecol[i]<-"black"
}
names(piecol)<-colnames(asr)
par(fg="transparent")
nodelabels(pie=asr,piecol=piecol,cex=0.5)
par(fg="black")
legend(x="topleft",legend=colnames(asr),
	pt.cex=1.8,pch=16,cex=0.8,col=piecol,
	bty="n")

plot of chunk unnamed-chunk-17

Cool.

For our final analysis, I’ll show how to apply our fitted model to the method of stochastic character mapping. Since both fitpolyMk and make.simmap are both from phytools this is going to be pretty easy.

smap.trees<-make.simmap(lizard.tree,ordered.transient$data,
	Q=as.Qmatrix(ordered.transient),pi=ordered.transient$pi,
	nsim=100)
## make.simmap is sampling character histories conditioned on
## the transition matrix
## 
## Q =
##                      ground bush+ground bush+ground+tree      bush  bush+tree
## ground           -0.6716599   0.6716599        0.0000000  0.000000  0.0000000
## bush+ground       2.0561858  -4.7840314        0.6716599  2.056186  0.0000000
## bush+ground+tree  0.0000000   2.0561858       -4.1123715  0.000000  2.0561858
## bush              0.0000000   0.6716599        0.0000000 -1.343320  0.6716599
## bush+tree         0.0000000   0.0000000        0.6716599  2.056186 -4.7840314
## tree              0.0000000   0.0000000        0.0000000  0.000000  0.6716599
##                        tree
## ground            0.0000000
## bush+ground       0.0000000
## bush+ground+tree  0.0000000
## bush              0.0000000
## bush+tree         2.0561858
## tree             -0.6716599
## (specified by the user);
## and (mean) root node prior probabilities
## pi =
##           ground      bush+ground bush+ground+tree             bush        bush+tree 
##        0.1666667        0.1666667        0.1666667        0.1666667        0.1666667 
##             tree 
##        0.1666667
## Done.
smap.trees
## 100 phylogenetic trees with mapped discrete characters

For fun, let’s look at our whole set of stochastic mapped trees.

par(mfrow=c(10,10))
plot(smap.trees,ftype="off",lwd=1,colors=piecol,type="fan")

plot of chunk unnamed-chunk-19

That’s neat, but a very common endeavor with stochastic mapping is to compute the posterior probabilities at nodes. We can do that very easily and – since we used the same fitted ML model – we’ll obtain values that very closely resemble our marginal ancestral states.

par(fg="transparent")
plot(summary(smap.trees),colors=piecol,type="fan",ftype="off",
	lwd=1,cex=c(0.5,0.3))
par(fg="black")
legend(x="topleft",legend=colnames(asr),
	pt.cex=1.8,pch=16,cex=0.8,col=piecol,
	bty="n")

plot of chunk unnamed-chunk-20

Neat. That’s pretty cool.

As promised, the data were simulated as follows.

levs<-c("ground","ground+bush",
	"ground+bush+tree","bush",
	"bush+tree","tree")
Q<-matrix(c(
	0,0.5,0,0,0,0,
	2,0,0.5,2,0,0,
	0,2,0,0,2,0,
	0,0.5,0,0,0.5,0,
	0,0,0.5,2,0,2,
	0,0,0,0,0.5,0),6,6,byrow=TRUE,
	dimnames=list(levs,levs))
habs<-sim.Mk(lizard.tree<-pbtree(n=300,scale=1),Q=Q)

That’s all folks.

4 comments:

  1. Dear Liam,
    This tutorial came at a perfect time for a project I'm working on. Was nice to see the demonstration of the corHMM package as well. I wonder if you could help me with a related problem? Following ancestral state reconstruction, I'm hoping to get the estimated number of state transitions per branch. Could you recommend an approach to do this?
    Much appreciate your advice,
    Joe

    ReplyDelete
  2. Hello Joe,

    I'm not Dr. Revell and there may be a better way to do this, but I would do it like this:

    simmap.results <- make.simmap(...)
    simmap.results <- lapply(simmap.results,function(x){
    names(x$maps) <- rownames(x$mapped.edge)
    return(x)
    })
    transitions.per.branch <- sapply(simmap.results,function(x){
    sapply(x$maps,function(y){length(y-1)})
    })
    rowMeans(transitions.per.branch)

    Best,
    Jeff

    ReplyDelete
  3. Apologies to the moderator this but I noticed a typo when I was notified this was posted, so posting again. Should be length(y)-1, not length(y-1). So overall:

    simmap.results <- make.simmap(...)
    simmap.results <- lapply(simmap.results,function(x){
    names(x$maps) <- rownames(x$mapped.edge)
    return(x)
    })
    transitions.per.branch <- sapply(simmap.results,function(x){
    sapply(x$maps,function(y){length(y)-1})
    })
    rowMeans(transitions.per.branch)

    ReplyDelete
  4. Hi Jeff, thank for your help. I'll try this out.

    ReplyDelete

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.