## Sunday, October 1, 2017

### Two different special cases of modeling discrete character evolution on trees

I responded to a long email from a phytools user that posed a couple of interesting questions about modeling discrete character evolution (and reconstructing ancestral states) on phylogenies.

Part 1 proposed the following scenario. All outgroup taxa contained a 'primitive' state for the character, while all ingroup taxa exhibited 1 of 6 derived character states; however the presumed evolutionary scenario is that evolution proceeded from the primitive to one of the 6 derived states (we just don't know which one) and then on to the others. We also presume that the primitive state cannot re-emerge from any of the derived states.

Our data might thus look something like this:

``````library(phytools)
plotTree(tree,type="fan",fsize=0.5,lwd=1)
``````

``````x
``````
``````## OUTGROUP1 OUTGROUP2        t7        t8       t57      t112      t113
## primitive primitive        T5        T5        T5        T5        T5
##       t58      t103      t104      t197      t198      t178      t179
##        T3        T6        T3        T3        T3        T3        T3
##       t13       t14        t1       t84       t85      t150      t151
##        T3        T4        T4        T2        T2        T4        T4
##      t156      t157       t63      t164      t165       t60      t144
##        T4        T4        T2        T4        T4        T5        T6
##      t145      t193      t194       t43      t195      t196      t160
##        T6        T6        T6        T4        T4        T4        T4
##      t167      t168       t47       t30       t31       t48       t49
##        T4        T4        T3        T1        T4        T6        T4
##       t80       t81      t154      t155        t4       t91       t92
##        T2        T6        T4        T4        T3        T6        T6
##       t75      t122      t123      t138      t139       t10       t38
##        T6        T6        T6        T5        T5        T3        T1
##       t39      t125      t126       t99      t100       t56      t101
##        T3        T4        T4        T3        T4        T4        T4
##      t102       t51       t32       t24        t6       t20       t22
##        T4        T4        T4        T4        T2        T5        T4
##      t106      t107       t33        t9       t65       t66      t158
##        T4        T4        T5        T5        T5        T5        T2
##      t159       t52       t53      t108      t109       t55       t25
##        T2        T6        T5        T3        T3        T3        T3
##       t16       t23      t176      t177       t59      t105      t114
##        T4        T4        T4        T4        T4        T3        T3
##      t115      t127      t189      t190       t27        t2       t41
##        T3        T4        T4        T4        T6        T4        T4
##       t42      t152      t153       t17        t5       t93       t94
##        T4        T5        T5        T5        T5        T5        T5
##       t44       t61       t69       t70      t148      t149       t79
##        T5        T4        T4        T4        T6        T6        T3
##      t180      t181      t163      t134      t135       t68       t34
##        T6        T6        T6        T6        T6        T6        T5
##      t191      t192       t67       t54        t3       t73       t74
##        T5        T5        T5        T5        T2        T6        T6
##      t185      t186       t90      t136      t137       t64       t50
##        T3        T3        T3        T3        T3        T3        T5
##      t116      t117       t37       t76      t199      t200      t173
##        T5        T5        T5        T5        T5        T5        T5
##      t128       t45      t161      t162       t35       t36       t18
##        T5        T1        T5        T5        T4        T2        T4
##      t118      t119      t124      t171      t172       t46      t146
##        T4        T3        T3        T3        T3        T3        T3
##      t147       t88       t89      t110      t111       t98      t174
##        T3        T4        T4        T4        T4        T4        T4
##      t175      t142      t143      t183      t184       t21      t140
##        T4        T4        T4        T6        T6        T2        T2
##      t141       t28       t82       t83       t96       t97       t62
##        T2        T2        T2        T2        T2        T2        T3
##      t129      t130      t133      t169      t170       t95       t15
##        T2        T2        T2        T2        T2        T2        T4
##       t86       t87       t77       t78       t40       t26       t71
##        T6        T5        T6        T6        T2        T6        T2
##       t72       t11       t12      t120      t121      t182      t187
##        T2        T5        T5        T2        T5        T2        T2
##      t188      t166       t29      t131      t132       t19
##        T2        T2        T4        T5        T3        T5
## Levels: primitive T1 T2 T3 T4 T5 T6
``````

The simplest way to model this is by permitting one rate of change between primitive and any derived state (but not back), and then a second rate of change between any derived state. Of course, there are many ways in which we could make this model more complex - for instance by imposing an order on character evolution, or permitting certain types of changes to be more likely than others. These would use a similar structure, but for simplicity's sake, I'll use the most basic model here:

``````model<-matrix(c(0,1,1,1,1,1,1,
0,0,2,2,2,2,2,
0,2,0,2,2,2,2,
0,2,2,0,2,2,2,
0,2,2,2,0,2,2,
0,2,2,2,2,0,2,
0,2,2,2,2,2,0),7,7,byrow=TRUE)
colnames(model)<-rownames(model)<-
levels(x)
model
``````
``````##           primitive T1 T2 T3 T4 T5 T6
## primitive         0  1  1  1  1  1  1
## T1                0  0  2  2  2  2  2
## T2                0  2  0  2  2  2  2
## T3                0  2  2  0  2  2  2
## T4                0  2  2  2  0  2  2
## T5                0  2  2  2  2  0  2
## T6                0  2  2  2  2  2  0
``````
``````trees<-make.simmap(tree,x,model=model,nsim=100)
``````
``````## make.simmap is sampling character histories conditioned on the transition matrix
##
## Q =
##            primitive          T1          T2          T3          T4
## primitive -0.5158406  0.08597343  0.08597343  0.08597343  0.08597343
## T1         0.0000000 -1.97405002  0.39481000  0.39481000  0.39481000
## T2         0.0000000  0.39481000 -1.97405002  0.39481000  0.39481000
## T3         0.0000000  0.39481000  0.39481000 -1.97405002  0.39481000
## T4         0.0000000  0.39481000  0.39481000  0.39481000 -1.97405002
## T5         0.0000000  0.39481000  0.39481000  0.39481000  0.39481000
## T6         0.0000000  0.39481000  0.39481000  0.39481000  0.39481000
##                    T5          T6
## primitive  0.08597343  0.08597343
## T1         0.39481000  0.39481000
## T2         0.39481000  0.39481000
## T3         0.39481000  0.39481000
## T4         0.39481000  0.39481000
## T5        -1.97405002  0.39481000
## T6         0.39481000 -1.97405002
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
## primitive        T1        T2        T3        T4        T5        T6
## 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571
``````
``````## Done.
``````
``````trees
``````
``````## 100 phylogenetic trees with mapped discrete characters
``````
``````## compute posterior probabilities at nodes
obj1<-summary(trees)
colors<-if(phytools:::.check.pkg("RColorBrewer"))
setNames(brewer.pal(length(levels(x)),"Accent"),levels(x)) else
setNames(palette()[1:7],levels(x))
plot(obj1,cex=c(0.4,0.2),fsize=0.5,type="fan",colors=colors)
``````

``````## obtain posterior probabilities at the MRCA of ingroup
ingroup<-names(x)[as.character(x)%in%paste("T",1:6,sep="")]
obj1\$ace[as.character(findMRCA(tree,ingroup)),]
``````
``````## primitive        T1        T2        T3        T4        T5        T6
##      0.07      0.02      0.02      0.01      0.25      0.63      0.00
``````

This isn't quite enough, though, because this scenario has not set the node that is the MRCA of the ingroup to be in states 1 through 6. We can see this reflected in our posterior probabilities (above).

Unfortunately, `make.simmap` does not yet let us explicitly specify priors on internal nodes other than the global root. However, we can work around this limitation by adding a zero length terminal edge to any internal node of the tree & then specifying a prior distribution on that node!

Here is what this looks like for our current case:

``````node<-findMRCA(tree,ingroup)
X<-to.matrix(x,levels(x))
X<-rbind(X,c(0,rep(1/6,6)))
rownames(X)[nrow(X)]<-node
tail(X)
``````
``````##      primitive        T1        T2        T3        T4        T5        T6
## t166         0 0.0000000 1.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## t29          0 0.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000
## t131         0 0.0000000 0.0000000 0.0000000 0.0000000 1.0000000 0.0000000
## t132         0 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000 0.0000000
## t19          0 0.0000000 0.0000000 0.0000000 0.0000000 1.0000000 0.0000000
## 204          0 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667
``````
``````tree<-bind.tip(tree,where=node,edge.length=0,tip.label=node)
trees<-make.simmap(tree,X,model=model,nsim=100)
``````
``````## make.simmap is sampling character histories conditioned on the transition matrix
##
## Q =
##            primitive          T1          T2          T3          T4
## primitive -0.4879016  0.08131693  0.08131693  0.08131693  0.08131693
## T1         0.0000000 -1.97333863  0.39466773  0.39466773  0.39466773
## T2         0.0000000  0.39466773 -1.97333863  0.39466773  0.39466773
## T3         0.0000000  0.39466773  0.39466773 -1.97333863  0.39466773
## T4         0.0000000  0.39466773  0.39466773  0.39466773 -1.97333863
## T5         0.0000000  0.39466773  0.39466773  0.39466773  0.39466773
## T6         0.0000000  0.39466773  0.39466773  0.39466773  0.39466773
##                    T5          T6
## primitive  0.08131693  0.08131693
## T1         0.39466773  0.39466773
## T2         0.39466773  0.39466773
## T3         0.39466773  0.39466773
## T4         0.39466773  0.39466773
## T5        -1.97333863  0.39466773
## T6         0.39466773 -1.97333863
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
## primitive        T1        T2        T3        T4        T5        T6
## 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571
``````
``````## Done.
``````
``````trees<-lapply(trees,drop.tip.simmap,as.character(node))
class(trees)<-c("multiSimmap","multiPhylo")
obj2<-summary(trees)
plot(obj2,cex=c(0.4,0.2),fsize=0.5,type="fan",colors=colors)
``````

``````obj2\$ace[as.character(node),]
``````
``````## primitive        T1        T2        T3        T4        T5        T6
##      0.00      0.04      0.02      0.02      0.21      0.69      0.02
``````

Drop the tip we added to our original tree:

``````tree<-drop.tip(tree,as.character(node))
``````

Part 2 of the email asked how to model characters with multiple states. That is to say, each lineage may have state `A` or state `B` or both state `A` and `B`, for example.

The simplest way to do this is by using a model that is similar to those used for modeling historical biogeography by assuming that transitions from `A` to `B` must go through `A+B` and so on.

Here is an imagined case with the states `A`, `B`, and `C`

``````states<-c("A","A+B","A+C","A+B+C","B","B+C","C")
model<-matrix(NA,7,7,dimnames=list(states,states))
model[1,]<-c(0,1,1,0,0,0,0)
model[2,]<-c(1,0,0,1,0,0,0)
model[3,]<-c(1,0,0,1,0,0,0)
model[4,]<-c(0,1,1,0,0,1,0)
model[5,]<-c(0,1,0,0,0,1,0)
model[6,]<-c(0,0,0,1,1,0,1)
model[7,]<-c(0,0,1,0,0,1,0)
model
``````
``````##       A A+B A+C A+B+C B B+C C
## A     0   1   1     0 0   0 0
## A+B   1   0   0     1 0   0 0
## A+C   1   0   0     1 0   0 0
## A+B+C 0   1   1     0 0   1 0
## B     0   1   0     0 0   1 0
## B+C   0   0   0     1 1   0 1
## C     0   0   1     0 0   1 0
``````

Note that our model permits only transitions in which states are added or subtracted: `A` to `A+B` to `A+B+C` are permitted, but not `A+B` to `B+C` - which actually requires as least two changes. The model also assumes that all (permitted) state changes occur with the same rate; however we could easily relax this assumption if we were so inclined.

``````trees<-make.simmap(drop.tip(tree,c("OUTGROUP1","OUTGROUP2")),y,
model=model,nsim=100)
``````
``````## Warning in log(comp[1:M + N]): NaNs produced

## Warning in log(comp[1:M + N]): NaNs produced
``````
``````## make.simmap is sampling character histories conditioned on the transition matrix
##
## Q =
##               A       A+B     A+B+C       A+C         B       B+C
## A     -2.652463  1.326231  1.326231  0.000000  0.000000  0.000000
## A+B    1.326231 -2.652463  0.000000  1.326231  0.000000  0.000000
## A+B+C  1.326231  0.000000 -2.652463  1.326231  0.000000  0.000000
## A+C    0.000000  1.326231  1.326231 -3.978694  0.000000  1.326231
## B      0.000000  1.326231  0.000000  0.000000 -2.652463  1.326231
## B+C    0.000000  0.000000  0.000000  1.326231  1.326231 -3.978694
## C      0.000000  0.000000  1.326231  0.000000  0.000000  1.326231
##               C
## A      0.000000
## A+B    0.000000
## A+B+C  0.000000
## A+C    0.000000
## B      0.000000
## B+C    1.326231
## C     -2.652463
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
##         A       A+B     A+B+C       A+C         B       B+C         C
## 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571
``````
``````## Warning in rstate(p/sum(p)): Some probabilities (slightly?) < 0. Setting p
## < 0 to zero.
``````
``````## Done.
``````
``````trees
``````
``````## 100 phylogenetic trees with mapped discrete characters
``````

We can visualize this in the typical way:

``````obj3<-summary(trees)
plot(obj3,cex=c(0.4,0.2),fsize=0.5,type="fan",
colors=setNames(colors,levels(y)))
x=-0.95,y=-0.65,fsize=0.7)
``````

but we also merge mapped states `A`, `A+B`, `A+C`, & `A+B+C` into `A`; and the other states into `not-A` and visualize the results. For instance:

``````treesA<-mergeMappedStates(trees,c("A","A+B","A+C","A+B+C"),"A")
treesA<-mergeMappedStates(treesA,c("B","B+C","C"),"not-A")
dmap<-densityMap(treesA,plot=FALSE,states=c("not-A","A"))
``````
``````## sorry - this might take a while; please be patient
``````
``````plot(dmap,type="fan",fsize=c(0.5,0.8),outline=T)
``````

or:

``````class(treesA)<-c("multiSimmap",class(treesA)) ## hack
obj4<-summary(treesA)
plot(obj4,cex=c(0.4,0.2),fsize=0.5,type="fan",
colors=setNames(c("blue","red"),c("not-A","A")))
prompt=FALSE,x=-0.95,y=-0.9,fsize=0.7)
``````

The data for this demo were simulated as follows.

First demo 1:

``````Q<-0.4*matrix(c(-5,1,1,1,1,1,
1,-5,1,1,1,1,
1,1,-5,1,1,1,
1,1,1,-5,1,1,
1,1,1,1,-5,1,
1,1,1,1,1,-5),6,6)
rownames(Q)<-colnames(Q)<-paste("T",1:6,sep="")
x<-sim.history(tree<-pbtree(n=200,scale=0.9),Q)\$states
tree<-midpoint.root(bind.tip(tree,where=Ntip(tree)+1,
tip.label="OUTGROUP1",edge.length=1.1))
tree<-bind.tip(tree,where=Ntip(tree)+1,tip.label="OUTGROUP2")
x<-c(rep("primitive",2),x)
names(x)[1:2]<-paste("OUTGROUP",1:2,sep="")
x<-as.factor(x)
``````

Now demo 2:

``````states<-c("A","A+B","A+C","A+B+C","B","B+C","C")
Q<-matrix(NA,7,7,dimnames=list(states,states))
Q[1,]<-c(-2,1,1,0,0,0,0)
Q[2,]<-c(1,-2,0,1,0,0,0)
Q[3,]<-c(1,0,-2,1,0,0,0)
Q[4,]<-c(0,1,1,-3,0,1,0)
Q[5,]<-c(0,1,0,0,-2,1,0)
Q[6,]<-c(0,0,0,1,1,-3,1)
Q[7,]<-c(0,0,1,0,0,1,-2)
y<-as.factor(sim.history(tree,Q,anc="A")\$states)
y<-y[-grep("OUT",names(y))]
``````

That's it.