Wednesday, December 14, 2022

Ancestral state reconstruction under the polymorphic trait evolution using stochastic mapping

I recently fielded a question about whether or not it was possible to use the polymorphic trait evolution model, fitpolyMk, to obtain posterior probabilities for internal nodes using the technique of stochastic character mapping.

The answer is yes, and, in fact, I posted about this way back in 2019. Nonetheless, I thought it might be worth reiterating today.

For this example I’ll use habitat data for Mycalesina butterflies from a study by Halali et al (2020).

The butterfly species occur in three distinct habitat types – forest, fringe, and open – but can also occur in more than one type, something that we’ll code as polymorphism. We’re also going to assume that evolution of this character is ordered – that is to say:

library(phytools)
graph.polyMk(k=3,states=c("forest","fringe","open"),
	model="ARD",ordered=TRUE)
title(main="\"Ordered\" polymorphic trait evolution model",
	font.main=3)

plot of chunk unnamed-chunk-1

These data feature in a problem set from my recent book with Luke Harmon, so they can be conveniently downloaded directly from the book website.

Let’s get ’em.

butterfly.tree<-read.nexus(file="http://www.phytools.org/Rbook/7/Mycalesina_phylogeny.nex")
butterfly.tree
## 
## Phylogenetic tree with 311 tips and 310 internal nodes.
## 
## Tip labels:
##   Uni_unica, Bra_peitho_gigas, Bra_decira, Bra_eliasis, Bra_simonsii, Bra_teratia, ...
## 
## Rooted; includes branch lengths.
butterfly.habitat<-read.csv(file="http://www.phytools.org/Rbook/7/Mycalesina_habitat.csv",
	row.names=1,stringsAsFactors=TRUE)
head(butterfly.habitat)
##                                     habitat
## Myc_francisca_formosana? forest+fringe+open
## Bic_cooksoni                           open
## Bic_brunnea                          forest
## Bic_jefferyi                    fringe+open
## Bic_auricruda_fulgida                forest
## Bic_smithi_smithi             forest+fringe

There are some differences between the taxa in the tree in the dataset. We can use the function geiger::name.check to resolve.

library(geiger)
chk<-name.check(butterfly.tree,butterfly.habitat)
summary(chk)
## 24 taxa are present in the tree but not the data:
##     Bic_pareensis,
##     Bra_eliasis,
##     Bra_teratia,
##     Cul_bisaya,
##     Eno_anthedon,
##     Eno_portlandia,
##     ....
## 
## To see complete list of mis-matched taxa, print object.
butterfly.tree<-drop.tip(butterfly.tree,
	chk$tree_not_data)
name.check(butterfly.tree,butterfly.habitat)
## [1] "OK"
habitat<-setNames(butterfly.habitat[[1]],
	rownames(butterfly.habitat))

Now let’s visualize our data. In this case, I will reproduce some previous code of my blog and show polymorphism at the tips of the tree using ‘pies.’

plotTree(butterfly.tree,ftype="off",lwd=1,type="fan")
X<-strsplit(setNames(as.character(habitat),names(habitat)),"+",
	fixed=TRUE)
pies<-matrix(0,Ntip(butterfly.tree),3,
	dimnames=list(butterfly.tree$tip.label,
	c("forest","fringe","open")))
for(i in 1:Ntip(butterfly.tree)) 
	pies[butterfly.tree$tip.label[i],
		X[[butterfly.tree$tip.label[i]]]]<-
		rep(1/length(X[[butterfly.tree$tip.label[i]]]),
		length(X[[butterfly.tree$tip.label[i]]]))
cols<-c(rgb(0,1,0),rgb(0,0,1),rgb(1,0,0))
par(fg="transparent")
tiplabels(pie=pies,piecol=cols,cex=0.3)
par(fg="black")
legend(x="topleft",legend=c("forest","fringe","open"),
	pt.cex=2,pch=16,col=cols)

plot of chunk unnamed-chunk-4

(OK, I don’t like these colors either, but using rgb is convenient for our purposes here for when we make ‘intermediate’ colors later!)

Next, we can fit our polymorphic trait evolution model using fitpolyMk.

Normally, we might first fit a set of models and compare them. Here, I will just use a standard all-rates-different ("ARD") ordered model as follows.

butterfly.ordered<-fitpolyMk(butterfly.tree,habitat,
	model="ARD",ordered=TRUE)
## 
## This is the design matrix of the fitted model. Does it make sense?
## 
##                    forest forest+fringe forest+fringe+open fringe fringe+open open
## forest                  0             1                  0      0           0    0
## forest+fringe           2             0                  3      5           0    0
## forest+fringe+open      0             4                  0      0           7    0
## fringe                  0             6                  0      0           9    0
## fringe+open             0             0                  8     10           0   11
## open                    0             0                  0      0          12    0
butterfly.ordered
## Object of class "fitpolyMk".
## 
## Evolution was modeled as 'ordered' (i.e., transitions are assumed
## to occur forest <-> forest+fringe <-> forest+fringe+open and so on) using the "ARD" model.
## 
## Fitted (or set) value of Q:
##                       forest forest+fringe forest+fringe+open    fringe fringe+open
## forest             -0.034759      0.034759           0.000000  0.000000    0.000000
## forest+fringe       0.110911     -0.208582           0.020277  0.077395    0.000000
## forest+fringe+open  0.000000      0.088086          -0.088086  0.000000    0.000000
## fringe              0.000000      0.495208           0.000000 -1.153636    0.658427
## fringe+open         0.000000      0.000000           0.033355  0.000000   -0.099319
## open                0.000000      0.000000           0.000000  0.000000    0.000000
##                        open
## forest             0.000000
## forest+fringe      0.000000
## forest+fringe+open 0.000000
## fringe             0.000000
## fringe+open        0.065964
## open               0.000000
## 
## Fitted (or set) value of pi:
##             forest      forest+fringe forest+fringe+open             fringe 
##           0.166667           0.166667           0.166667           0.166667 
##        fringe+open               open 
##           0.166667           0.166667 
## 
## Log-likelihood: -297.810038 
## 
## Optimization method used was "nlminb"

Let’s plot our fitted model.

plot(butterfly.ordered)
title(main=
	"Fitted polymorphic trait evolution model\nfor buttefly habitat use",
	font.main=3)

plot of chunk unnamed-chunk-6

Alright, now, believe it or not, we’re basically ready to do stochastic mapping.

To take this next step, we first just pull out our data and fitted model from the "fitpolyMk" object as follows.

Q<-as.Qmatrix(butterfly.ordered)
print(Q,digits=3)
## Estimated Q matrix:
##                     forest forest+fringe forest+fringe+open  fringe fringe+open  open
## forest             -0.0348        0.0348             0.0000  0.0000      0.0000 0.000
## forest+fringe       0.1109       -0.2086             0.0203  0.0774      0.0000 0.000
## forest+fringe+open  0.0000        0.0881            -0.0881  0.0000      0.0000 0.000
## fringe              0.0000        0.4952             0.0000 -1.1536      0.6584 0.000
## fringe+open         0.0000        0.0000             0.0334  0.0000     -0.0993 0.066
## open                0.0000        0.0000             0.0000  0.0000      0.0000 0.000
X<-butterfly.ordered$data
head(X)
##                          forest forest+fringe forest+fringe+open fringe fringe+open open
## Myc_francisca_formosana?      0             0                  1      0           0    0
## Bic_cooksoni                  0             0                  0      0           0    1
## Bic_brunnea                   1             0                  0      0           0    0
## Bic_jefferyi                  0             0                  0      0           1    0
## Bic_auricruda_fulgida         1             0                  0      0           0    0
## Bic_smithi_smithi             0             1                  0      0           0    0

Then, we can go ahead and use these two objects as input for make.simmap.

butterfly.maps<-make.simmap(butterfly.tree,x=X,Q=Q,
	nsim=100)
## make.simmap is sampling character histories conditioned on
## the transition matrix
## 
## Q =
##                         forest forest+fringe forest+fringe+open      fringe fringe+open
## forest             -0.03475871    0.03475871         0.00000000  0.00000000  0.00000000
## forest+fringe       0.11091079   -0.20858240         0.02027704  0.07739457  0.00000000
## forest+fringe+open  0.00000000    0.08808561        -0.08808561  0.00000000  0.00000000
## fringe              0.00000000    0.49520827         0.00000000 -1.15363566  0.65842739
## fringe+open         0.00000000    0.00000000         0.03335454  0.00000000 -0.09931852
## open                0.00000000    0.00000000         0.00000000  0.00000000  0.00000000
##                          open
## forest             0.00000000
## forest+fringe      0.00000000
## forest+fringe+open 0.00000000
## fringe             0.00000000
## fringe+open        0.06596398
## open               0.00000000
## (specified by the user);
## and (mean) root node prior probabilities
## pi =
##             forest      forest+fringe forest+fringe+open             fringe 
##          0.1666667          0.1666667          0.1666667          0.1666667 
##        fringe+open               open 
##          0.1666667          0.1666667
## Done.
butterfly.maps
## 100 phylogenetic trees with mapped discrete characters

(There’s actually another way to do this to. We can actually take the object index.matrix from our fitted model and then use this as our design matrix model in make.simmap! If we do it that way, we can also use MCMC to sample the transition rates from their posterior distribution by setting Q="mcmc".)

Let’s now plot a single stochastic map – and I’ll overlay posterior probabilities for all internal nodes using ape::nodelabels.

map.cols<-setNames(c(rgb(0,1,0),rgb(0,0.5,0.5),rgb(1/3,1/3,1/3),
	rgb(0,0,1),rgb(0.5,0.5,0),rgb(1,0,0)),levels(habitat))
plot(butterfly.maps[[1]],map.cols,ftype="off",
	ylim=c(0,1.05*Ntip(butterfly.tree)))
legend(x="topleft",legend=levels(habitat),pt.cex=1.2,pch=15,
	col=map.cols,cex=0.8,bty="n",ncol=2)
obj<-summary(butterfly.maps)
par(fg="transparent")
nodelabels(pie=obj$ace[1:butterfly.tree$Nnode,],piecol=map.cols,
	cex=0.3)

plot of chunk unnamed-chunk-9

par(fg="black")

Last of all, we can try something fun. Here, I’ll overlay, one by one, all the stochastic character mapped histories in our sample. This is a visualization to approximate what can be obtained using phytools::densityMap, but for a multi-state character. (Depending on what device you’re rendering to, this can be very slow!)

par(lend=1)
for(i in 1:length(butterfly.maps))
	plot(butterfly.maps[[i]],make.transparent(map.cols,
		1/length(butterfly.maps)),add=if(i==1) FALSE else TRUE,
		type="fan",ftype="off",lwd=3)
nodelabels(pie=obj$ace[1:butterfly.tree$Nnode,],piecol=map.cols,
	cex=0.4)
par(fg="black")
legend(x="topleft",legend=levels(habitat),pt.cex=1.2,pch=15,
	col=map.cols,cex=0.8,bty="n",ncol=2)