Sunday, October 1, 2017

Two different special cases of modeling discrete character evolution on trees

I responded to a long email from a phytools user that posed a couple of interesting questions about modeling discrete character evolution (and reconstructing ancestral states) on phylogenies.

Part 1 proposed the following scenario. All outgroup taxa contained a 'primitive' state for the character, while all ingroup taxa exhibited 1 of 6 derived character states; however the presumed evolutionary scenario is that evolution proceeded from the primitive to one of the 6 derived states (we just don't know which one) and then on to the others. We also presume that the primitive state cannot re-emerge from any of the derived states.

Our data might thus look something like this:

library(phytools)
plotTree(tree,type="fan",fsize=0.5,lwd=1)

plot of chunk unnamed-chunk-1

x
## OUTGROUP1 OUTGROUP2        t7        t8       t57      t112      t113 
## primitive primitive        T5        T5        T5        T5        T5 
##       t58      t103      t104      t197      t198      t178      t179 
##        T3        T6        T3        T3        T3        T3        T3 
##       t13       t14        t1       t84       t85      t150      t151 
##        T3        T4        T4        T2        T2        T4        T4 
##      t156      t157       t63      t164      t165       t60      t144 
##        T4        T4        T2        T4        T4        T5        T6 
##      t145      t193      t194       t43      t195      t196      t160 
##        T6        T6        T6        T4        T4        T4        T4 
##      t167      t168       t47       t30       t31       t48       t49 
##        T4        T4        T3        T1        T4        T6        T4 
##       t80       t81      t154      t155        t4       t91       t92 
##        T2        T6        T4        T4        T3        T6        T6 
##       t75      t122      t123      t138      t139       t10       t38 
##        T6        T6        T6        T5        T5        T3        T1 
##       t39      t125      t126       t99      t100       t56      t101 
##        T3        T4        T4        T3        T4        T4        T4 
##      t102       t51       t32       t24        t6       t20       t22 
##        T4        T4        T4        T4        T2        T5        T4 
##      t106      t107       t33        t9       t65       t66      t158 
##        T4        T4        T5        T5        T5        T5        T2 
##      t159       t52       t53      t108      t109       t55       t25 
##        T2        T6        T5        T3        T3        T3        T3 
##       t16       t23      t176      t177       t59      t105      t114 
##        T4        T4        T4        T4        T4        T3        T3 
##      t115      t127      t189      t190       t27        t2       t41 
##        T3        T4        T4        T4        T6        T4        T4 
##       t42      t152      t153       t17        t5       t93       t94 
##        T4        T5        T5        T5        T5        T5        T5 
##       t44       t61       t69       t70      t148      t149       t79 
##        T5        T4        T4        T4        T6        T6        T3 
##      t180      t181      t163      t134      t135       t68       t34 
##        T6        T6        T6        T6        T6        T6        T5 
##      t191      t192       t67       t54        t3       t73       t74 
##        T5        T5        T5        T5        T2        T6        T6 
##      t185      t186       t90      t136      t137       t64       t50 
##        T3        T3        T3        T3        T3        T3        T5 
##      t116      t117       t37       t76      t199      t200      t173 
##        T5        T5        T5        T5        T5        T5        T5 
##      t128       t45      t161      t162       t35       t36       t18 
##        T5        T1        T5        T5        T4        T2        T4 
##      t118      t119      t124      t171      t172       t46      t146 
##        T4        T3        T3        T3        T3        T3        T3 
##      t147       t88       t89      t110      t111       t98      t174 
##        T3        T4        T4        T4        T4        T4        T4 
##      t175      t142      t143      t183      t184       t21      t140 
##        T4        T4        T4        T6        T6        T2        T2 
##      t141       t28       t82       t83       t96       t97       t62 
##        T2        T2        T2        T2        T2        T2        T3 
##      t129      t130      t133      t169      t170       t95       t15 
##        T2        T2        T2        T2        T2        T2        T4 
##       t86       t87       t77       t78       t40       t26       t71 
##        T6        T5        T6        T6        T2        T6        T2 
##       t72       t11       t12      t120      t121      t182      t187 
##        T2        T5        T5        T2        T5        T2        T2 
##      t188      t166       t29      t131      t132       t19 
##        T2        T2        T4        T5        T3        T5 
## Levels: primitive T1 T2 T3 T4 T5 T6

The simplest way to model this is by permitting one rate of change between primitive and any derived state (but not back), and then a second rate of change between any derived state. Of course, there are many ways in which we could make this model more complex - for instance by imposing an order on character evolution, or permitting certain types of changes to be more likely than others. These would use a similar structure, but for simplicity's sake, I'll use the most basic model here:

model<-matrix(c(0,1,1,1,1,1,1,
    0,0,2,2,2,2,2,
    0,2,0,2,2,2,2,
    0,2,2,0,2,2,2,
    0,2,2,2,0,2,2,
    0,2,2,2,2,0,2,
    0,2,2,2,2,2,0),7,7,byrow=TRUE)
colnames(model)<-rownames(model)<-
    levels(x)
model
##           primitive T1 T2 T3 T4 T5 T6
## primitive         0  1  1  1  1  1  1
## T1                0  0  2  2  2  2  2
## T2                0  2  0  2  2  2  2
## T3                0  2  2  0  2  2  2
## T4                0  2  2  2  0  2  2
## T5                0  2  2  2  2  0  2
## T6                0  2  2  2  2  2  0
trees<-make.simmap(tree,x,model=model,nsim=100)
## make.simmap is sampling character histories conditioned on the transition matrix
## 
## Q =
##            primitive          T1          T2          T3          T4
## primitive -0.5158406  0.08597343  0.08597343  0.08597343  0.08597343
## T1         0.0000000 -1.97405002  0.39481000  0.39481000  0.39481000
## T2         0.0000000  0.39481000 -1.97405002  0.39481000  0.39481000
## T3         0.0000000  0.39481000  0.39481000 -1.97405002  0.39481000
## T4         0.0000000  0.39481000  0.39481000  0.39481000 -1.97405002
## T5         0.0000000  0.39481000  0.39481000  0.39481000  0.39481000
## T6         0.0000000  0.39481000  0.39481000  0.39481000  0.39481000
##                    T5          T6
## primitive  0.08597343  0.08597343
## T1         0.39481000  0.39481000
## T2         0.39481000  0.39481000
## T3         0.39481000  0.39481000
## T4         0.39481000  0.39481000
## T5        -1.97405002  0.39481000
## T6         0.39481000 -1.97405002
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
## primitive        T1        T2        T3        T4        T5        T6 
## 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571
## Done.
trees
## 100 phylogenetic trees with mapped discrete characters
## compute posterior probabilities at nodes
obj1<-summary(trees)
colors<-if(phytools:::.check.pkg("RColorBrewer"))
    setNames(brewer.pal(length(levels(x)),"Accent"),levels(x)) else 
    setNames(palette()[1:7],levels(x))
plot(obj1,cex=c(0.4,0.2),fsize=0.5,type="fan",colors=colors)
add.simmap.legend(colors=colors,prompt=FALSE,x=-1.2,y=-0.75,fsize=0.7)

plot of chunk unnamed-chunk-2

## obtain posterior probabilities at the MRCA of ingroup
ingroup<-names(x)[as.character(x)%in%paste("T",1:6,sep="")]
obj1$ace[as.character(findMRCA(tree,ingroup)),]
## primitive        T1        T2        T3        T4        T5        T6 
##      0.07      0.02      0.02      0.01      0.25      0.63      0.00

This isn't quite enough, though, because this scenario has not set the node that is the MRCA of the ingroup to be in states 1 through 6. We can see this reflected in our posterior probabilities (above).

Unfortunately, make.simmap does not yet let us explicitly specify priors on internal nodes other than the global root. However, we can work around this limitation by adding a zero length terminal edge to any internal node of the tree & then specifying a prior distribution on that node!

Here is what this looks like for our current case:

node<-findMRCA(tree,ingroup)
X<-to.matrix(x,levels(x))
X<-rbind(X,c(0,rep(1/6,6)))
rownames(X)[nrow(X)]<-node
tail(X)
##      primitive        T1        T2        T3        T4        T5        T6
## t166         0 0.0000000 1.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## t29          0 0.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000
## t131         0 0.0000000 0.0000000 0.0000000 0.0000000 1.0000000 0.0000000
## t132         0 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000 0.0000000
## t19          0 0.0000000 0.0000000 0.0000000 0.0000000 1.0000000 0.0000000
## 204          0 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667
tree<-bind.tip(tree,where=node,edge.length=0,tip.label=node)
trees<-make.simmap(tree,X,model=model,nsim=100)
## make.simmap is sampling character histories conditioned on the transition matrix
## 
## Q =
##            primitive          T1          T2          T3          T4
## primitive -0.4879016  0.08131693  0.08131693  0.08131693  0.08131693
## T1         0.0000000 -1.97333863  0.39466773  0.39466773  0.39466773
## T2         0.0000000  0.39466773 -1.97333863  0.39466773  0.39466773
## T3         0.0000000  0.39466773  0.39466773 -1.97333863  0.39466773
## T4         0.0000000  0.39466773  0.39466773  0.39466773 -1.97333863
## T5         0.0000000  0.39466773  0.39466773  0.39466773  0.39466773
## T6         0.0000000  0.39466773  0.39466773  0.39466773  0.39466773
##                    T5          T6
## primitive  0.08131693  0.08131693
## T1         0.39466773  0.39466773
## T2         0.39466773  0.39466773
## T3         0.39466773  0.39466773
## T4         0.39466773  0.39466773
## T5        -1.97333863  0.39466773
## T6         0.39466773 -1.97333863
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
## primitive        T1        T2        T3        T4        T5        T6 
## 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571
## Done.
trees<-lapply(trees,drop.tip.simmap,as.character(node))
class(trees)<-c("multiSimmap","multiPhylo")
obj2<-summary(trees)
plot(obj2,cex=c(0.4,0.2),fsize=0.5,type="fan",colors=colors)
add.simmap.legend(colors=colors,prompt=FALSE,x=-1.2,y=-0.75,fsize=0.7)

plot of chunk unnamed-chunk-3

obj2$ace[as.character(node),]
## primitive        T1        T2        T3        T4        T5        T6 
##      0.00      0.04      0.02      0.02      0.21      0.69      0.02

Drop the tip we added to our original tree:

tree<-drop.tip(tree,as.character(node))

Part 2 of the email asked how to model characters with multiple states. That is to say, each lineage may have state A or state B or both state A and B, for example.

The simplest way to do this is by using a model that is similar to those used for modeling historical biogeography by assuming that transitions from A to B must go through A+B and so on.

Here is an imagined case with the states A, B, and C

states<-c("A","A+B","A+C","A+B+C","B","B+C","C")
model<-matrix(NA,7,7,dimnames=list(states,states))
model[1,]<-c(0,1,1,0,0,0,0)
model[2,]<-c(1,0,0,1,0,0,0)
model[3,]<-c(1,0,0,1,0,0,0)
model[4,]<-c(0,1,1,0,0,1,0)
model[5,]<-c(0,1,0,0,0,1,0)
model[6,]<-c(0,0,0,1,1,0,1)
model[7,]<-c(0,0,1,0,0,1,0)
model
##       A A+B A+C A+B+C B B+C C
## A     0   1   1     0 0   0 0
## A+B   1   0   0     1 0   0 0
## A+C   1   0   0     1 0   0 0
## A+B+C 0   1   1     0 0   1 0
## B     0   1   0     0 0   1 0
## B+C   0   0   0     1 1   0 1
## C     0   0   1     0 0   1 0

Note that our model permits only transitions in which states are added or subtracted: A to A+B to A+B+C are permitted, but not A+B to B+C - which actually requires as least two changes. The model also assumes that all (permitted) state changes occur with the same rate; however we could easily relax this assumption if we were so inclined.

trees<-make.simmap(drop.tip(tree,c("OUTGROUP1","OUTGROUP2")),y,
    model=model,nsim=100)
## Warning in log(comp[1:M + N]): NaNs produced

## Warning in log(comp[1:M + N]): NaNs produced
## make.simmap is sampling character histories conditioned on the transition matrix
## 
## Q =
##               A       A+B     A+B+C       A+C         B       B+C
## A     -2.652463  1.326231  1.326231  0.000000  0.000000  0.000000
## A+B    1.326231 -2.652463  0.000000  1.326231  0.000000  0.000000
## A+B+C  1.326231  0.000000 -2.652463  1.326231  0.000000  0.000000
## A+C    0.000000  1.326231  1.326231 -3.978694  0.000000  1.326231
## B      0.000000  1.326231  0.000000  0.000000 -2.652463  1.326231
## B+C    0.000000  0.000000  0.000000  1.326231  1.326231 -3.978694
## C      0.000000  0.000000  1.326231  0.000000  0.000000  1.326231
##               C
## A      0.000000
## A+B    0.000000
## A+B+C  0.000000
## A+C    0.000000
## B      0.000000
## B+C    1.326231
## C     -2.652463
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
##         A       A+B     A+B+C       A+C         B       B+C         C 
## 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571
## Warning in rstate(p/sum(p)): Some probabilities (slightly?) < 0. Setting p
## < 0 to zero.
## Done.
trees
## 100 phylogenetic trees with mapped discrete characters

We can visualize this in the typical way:

obj3<-summary(trees)
plot(obj3,cex=c(0.4,0.2),fsize=0.5,type="fan",
    colors=setNames(colors,levels(y)))
add.simmap.legend(colors=setNames(colors,levels(y)),prompt=FALSE,
    x=-0.95,y=-0.65,fsize=0.7)

plot of chunk unnamed-chunk-6

but we also merge mapped states A, A+B, A+C, & A+B+C into A; and the other states into not-A and visualize the results. For instance:

treesA<-mergeMappedStates(trees,c("A","A+B","A+C","A+B+C"),"A")
treesA<-mergeMappedStates(treesA,c("B","B+C","C"),"not-A")
dmap<-densityMap(treesA,plot=FALSE,states=c("not-A","A"))
## sorry - this might take a while; please be patient
plot(dmap,type="fan",fsize=c(0.5,0.8),outline=T)

plot of chunk unnamed-chunk-7

or:

class(treesA)<-c("multiSimmap",class(treesA)) ## hack
obj4<-summary(treesA)
plot(obj4,cex=c(0.4,0.2),fsize=0.5,type="fan",
    colors=setNames(c("blue","red"),c("not-A","A")))
add.simmap.legend(colors=setNames(c("blue","red"),c("not-A","A"))[2:1],
    prompt=FALSE,x=-0.95,y=-0.9,fsize=0.7)

plot of chunk unnamed-chunk-8

The data for this demo were simulated as follows.

First demo 1:

Q<-0.4*matrix(c(-5,1,1,1,1,1,
    1,-5,1,1,1,1,
    1,1,-5,1,1,1,
    1,1,1,-5,1,1,
    1,1,1,1,-5,1,
    1,1,1,1,1,-5),6,6)
rownames(Q)<-colnames(Q)<-paste("T",1:6,sep="")
x<-sim.history(tree<-pbtree(n=200,scale=0.9),Q)$states
tree<-midpoint.root(bind.tip(tree,where=Ntip(tree)+1,
    tip.label="OUTGROUP1",edge.length=1.1))
tree<-bind.tip(tree,where=Ntip(tree)+1,tip.label="OUTGROUP2")
x<-c(rep("primitive",2),x)
names(x)[1:2]<-paste("OUTGROUP",1:2,sep="")
x<-as.factor(x)

Now demo 2:

states<-c("A","A+B","A+C","A+B+C","B","B+C","C")
Q<-matrix(NA,7,7,dimnames=list(states,states))
Q[1,]<-c(-2,1,1,0,0,0,0)
Q[2,]<-c(1,-2,0,1,0,0,0)
Q[3,]<-c(1,0,-2,1,0,0,0)
Q[4,]<-c(0,1,1,-3,0,1,0)
Q[5,]<-c(0,1,0,0,-2,1,0)
Q[6,]<-c(0,0,0,1,1,-3,1)
Q[7,]<-c(0,0,1,0,0,1,-2)
y<-as.factor(sim.history(tree,Q,anc="A")$states)
y<-y[-grep("OUT",names(y))]

That's it.

No comments:

Post a Comment