I responded to a long email from a phytools user that posed a couple of interesting questions about modeling discrete character evolution (and reconstructing ancestral states) on phylogenies.
Part 1 proposed the following scenario. All outgroup taxa contained a 'primitive' state for the character, while all ingroup taxa exhibited 1 of 6 derived character states; however the presumed evolutionary scenario is that evolution proceeded from the primitive to one of the 6 derived states (we just don't know which one) and then on to the others. We also presume that the primitive state cannot re-emerge from any of the derived states.
Our data might thus look something like this:
library(phytools)
plotTree(tree,type="fan",fsize=0.5,lwd=1)
x
## OUTGROUP1 OUTGROUP2 t7 t8 t57 t112 t113
## primitive primitive T5 T5 T5 T5 T5
## t58 t103 t104 t197 t198 t178 t179
## T3 T6 T3 T3 T3 T3 T3
## t13 t14 t1 t84 t85 t150 t151
## T3 T4 T4 T2 T2 T4 T4
## t156 t157 t63 t164 t165 t60 t144
## T4 T4 T2 T4 T4 T5 T6
## t145 t193 t194 t43 t195 t196 t160
## T6 T6 T6 T4 T4 T4 T4
## t167 t168 t47 t30 t31 t48 t49
## T4 T4 T3 T1 T4 T6 T4
## t80 t81 t154 t155 t4 t91 t92
## T2 T6 T4 T4 T3 T6 T6
## t75 t122 t123 t138 t139 t10 t38
## T6 T6 T6 T5 T5 T3 T1
## t39 t125 t126 t99 t100 t56 t101
## T3 T4 T4 T3 T4 T4 T4
## t102 t51 t32 t24 t6 t20 t22
## T4 T4 T4 T4 T2 T5 T4
## t106 t107 t33 t9 t65 t66 t158
## T4 T4 T5 T5 T5 T5 T2
## t159 t52 t53 t108 t109 t55 t25
## T2 T6 T5 T3 T3 T3 T3
## t16 t23 t176 t177 t59 t105 t114
## T4 T4 T4 T4 T4 T3 T3
## t115 t127 t189 t190 t27 t2 t41
## T3 T4 T4 T4 T6 T4 T4
## t42 t152 t153 t17 t5 t93 t94
## T4 T5 T5 T5 T5 T5 T5
## t44 t61 t69 t70 t148 t149 t79
## T5 T4 T4 T4 T6 T6 T3
## t180 t181 t163 t134 t135 t68 t34
## T6 T6 T6 T6 T6 T6 T5
## t191 t192 t67 t54 t3 t73 t74
## T5 T5 T5 T5 T2 T6 T6
## t185 t186 t90 t136 t137 t64 t50
## T3 T3 T3 T3 T3 T3 T5
## t116 t117 t37 t76 t199 t200 t173
## T5 T5 T5 T5 T5 T5 T5
## t128 t45 t161 t162 t35 t36 t18
## T5 T1 T5 T5 T4 T2 T4
## t118 t119 t124 t171 t172 t46 t146
## T4 T3 T3 T3 T3 T3 T3
## t147 t88 t89 t110 t111 t98 t174
## T3 T4 T4 T4 T4 T4 T4
## t175 t142 t143 t183 t184 t21 t140
## T4 T4 T4 T6 T6 T2 T2
## t141 t28 t82 t83 t96 t97 t62
## T2 T2 T2 T2 T2 T2 T3
## t129 t130 t133 t169 t170 t95 t15
## T2 T2 T2 T2 T2 T2 T4
## t86 t87 t77 t78 t40 t26 t71
## T6 T5 T6 T6 T2 T6 T2
## t72 t11 t12 t120 t121 t182 t187
## T2 T5 T5 T2 T5 T2 T2
## t188 t166 t29 t131 t132 t19
## T2 T2 T4 T5 T3 T5
## Levels: primitive T1 T2 T3 T4 T5 T6
The simplest way to model this is by permitting one rate of change between primitive and any derived state (but not back), and then a second rate of change between any derived state. Of course, there are many ways in which we could make this model more complex - for instance by imposing an order on character evolution, or permitting certain types of changes to be more likely than others. These would use a similar structure, but for simplicity's sake, I'll use the most basic model here:
model<-matrix(c(0,1,1,1,1,1,1,
0,0,2,2,2,2,2,
0,2,0,2,2,2,2,
0,2,2,0,2,2,2,
0,2,2,2,0,2,2,
0,2,2,2,2,0,2,
0,2,2,2,2,2,0),7,7,byrow=TRUE)
colnames(model)<-rownames(model)<-
levels(x)
model
## primitive T1 T2 T3 T4 T5 T6
## primitive 0 1 1 1 1 1 1
## T1 0 0 2 2 2 2 2
## T2 0 2 0 2 2 2 2
## T3 0 2 2 0 2 2 2
## T4 0 2 2 2 0 2 2
## T5 0 2 2 2 2 0 2
## T6 0 2 2 2 2 2 0
trees<-make.simmap(tree,x,model=model,nsim=100)
## make.simmap is sampling character histories conditioned on the transition matrix
##
## Q =
## primitive T1 T2 T3 T4
## primitive -0.5158406 0.08597343 0.08597343 0.08597343 0.08597343
## T1 0.0000000 -1.97405002 0.39481000 0.39481000 0.39481000
## T2 0.0000000 0.39481000 -1.97405002 0.39481000 0.39481000
## T3 0.0000000 0.39481000 0.39481000 -1.97405002 0.39481000
## T4 0.0000000 0.39481000 0.39481000 0.39481000 -1.97405002
## T5 0.0000000 0.39481000 0.39481000 0.39481000 0.39481000
## T6 0.0000000 0.39481000 0.39481000 0.39481000 0.39481000
## T5 T6
## primitive 0.08597343 0.08597343
## T1 0.39481000 0.39481000
## T2 0.39481000 0.39481000
## T3 0.39481000 0.39481000
## T4 0.39481000 0.39481000
## T5 -1.97405002 0.39481000
## T6 0.39481000 -1.97405002
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
## primitive T1 T2 T3 T4 T5 T6
## 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571
## Done.
trees
## 100 phylogenetic trees with mapped discrete characters
## compute posterior probabilities at nodes
obj1<-summary(trees)
colors<-if(phytools:::.check.pkg("RColorBrewer"))
setNames(brewer.pal(length(levels(x)),"Accent"),levels(x)) else
setNames(palette()[1:7],levels(x))
plot(obj1,cex=c(0.4,0.2),fsize=0.5,type="fan",colors=colors)
add.simmap.legend(colors=colors,prompt=FALSE,x=-1.2,y=-0.75,fsize=0.7)
## obtain posterior probabilities at the MRCA of ingroup
ingroup<-names(x)[as.character(x)%in%paste("T",1:6,sep="")]
obj1$ace[as.character(findMRCA(tree,ingroup)),]
## primitive T1 T2 T3 T4 T5 T6
## 0.07 0.02 0.02 0.01 0.25 0.63 0.00
This isn't quite enough, though, because this scenario has not set the node that is the MRCA of the ingroup to be in states 1 through 6. We can see this reflected in our posterior probabilities (above).
Unfortunately, make.simmap
does not yet let us explicitly
specify priors on internal nodes other than the global root. However,
we can work around this limitation by adding a zero length terminal
edge to any internal node of the tree & then specifying a prior
distribution on that node!
Here is what this looks like for our current case:
node<-findMRCA(tree,ingroup)
X<-to.matrix(x,levels(x))
X<-rbind(X,c(0,rep(1/6,6)))
rownames(X)[nrow(X)]<-node
tail(X)
## primitive T1 T2 T3 T4 T5 T6
## t166 0 0.0000000 1.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## t29 0 0.0000000 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000
## t131 0 0.0000000 0.0000000 0.0000000 0.0000000 1.0000000 0.0000000
## t132 0 0.0000000 0.0000000 1.0000000 0.0000000 0.0000000 0.0000000
## t19 0 0.0000000 0.0000000 0.0000000 0.0000000 1.0000000 0.0000000
## 204 0 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667 0.1666667
tree<-bind.tip(tree,where=node,edge.length=0,tip.label=node)
trees<-make.simmap(tree,X,model=model,nsim=100)
## make.simmap is sampling character histories conditioned on the transition matrix
##
## Q =
## primitive T1 T2 T3 T4
## primitive -0.4879016 0.08131693 0.08131693 0.08131693 0.08131693
## T1 0.0000000 -1.97333863 0.39466773 0.39466773 0.39466773
## T2 0.0000000 0.39466773 -1.97333863 0.39466773 0.39466773
## T3 0.0000000 0.39466773 0.39466773 -1.97333863 0.39466773
## T4 0.0000000 0.39466773 0.39466773 0.39466773 -1.97333863
## T5 0.0000000 0.39466773 0.39466773 0.39466773 0.39466773
## T6 0.0000000 0.39466773 0.39466773 0.39466773 0.39466773
## T5 T6
## primitive 0.08131693 0.08131693
## T1 0.39466773 0.39466773
## T2 0.39466773 0.39466773
## T3 0.39466773 0.39466773
## T4 0.39466773 0.39466773
## T5 -1.97333863 0.39466773
## T6 0.39466773 -1.97333863
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
## primitive T1 T2 T3 T4 T5 T6
## 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571
## Done.
trees<-lapply(trees,drop.tip.simmap,as.character(node))
class(trees)<-c("multiSimmap","multiPhylo")
obj2<-summary(trees)
plot(obj2,cex=c(0.4,0.2),fsize=0.5,type="fan",colors=colors)
add.simmap.legend(colors=colors,prompt=FALSE,x=-1.2,y=-0.75,fsize=0.7)
obj2$ace[as.character(node),]
## primitive T1 T2 T3 T4 T5 T6
## 0.00 0.04 0.02 0.02 0.21 0.69 0.02
Drop the tip we added to our original tree:
tree<-drop.tip(tree,as.character(node))
Part 2 of the email asked how to model characters with multiple states. That is to say, each lineage may have state A
or state B
or both state A
and B
, for example.
The simplest way to do this is by using a model that is similar to those
used for modeling historical biogeography by assuming that transitions
from A
to B
must go through A+B
and so on.
Here is an imagined case with the states A
, B
,
and C
states<-c("A","A+B","A+C","A+B+C","B","B+C","C")
model<-matrix(NA,7,7,dimnames=list(states,states))
model[1,]<-c(0,1,1,0,0,0,0)
model[2,]<-c(1,0,0,1,0,0,0)
model[3,]<-c(1,0,0,1,0,0,0)
model[4,]<-c(0,1,1,0,0,1,0)
model[5,]<-c(0,1,0,0,0,1,0)
model[6,]<-c(0,0,0,1,1,0,1)
model[7,]<-c(0,0,1,0,0,1,0)
model
## A A+B A+C A+B+C B B+C C
## A 0 1 1 0 0 0 0
## A+B 1 0 0 1 0 0 0
## A+C 1 0 0 1 0 0 0
## A+B+C 0 1 1 0 0 1 0
## B 0 1 0 0 0 1 0
## B+C 0 0 0 1 1 0 1
## C 0 0 1 0 0 1 0
Note that our model permits only transitions in which states are added or subtracted: A
to A+B
to A+B+C
are permitted, but not A+B
to B+C
- which actually requires as least two changes. The model also assumes that all (permitted) state changes occur with the same rate; however we could easily relax this assumption if we were so inclined.
trees<-make.simmap(drop.tip(tree,c("OUTGROUP1","OUTGROUP2")),y,
model=model,nsim=100)
## Warning in log(comp[1:M + N]): NaNs produced
## Warning in log(comp[1:M + N]): NaNs produced
## make.simmap is sampling character histories conditioned on the transition matrix
##
## Q =
## A A+B A+B+C A+C B B+C
## A -2.652463 1.326231 1.326231 0.000000 0.000000 0.000000
## A+B 1.326231 -2.652463 0.000000 1.326231 0.000000 0.000000
## A+B+C 1.326231 0.000000 -2.652463 1.326231 0.000000 0.000000
## A+C 0.000000 1.326231 1.326231 -3.978694 0.000000 1.326231
## B 0.000000 1.326231 0.000000 0.000000 -2.652463 1.326231
## B+C 0.000000 0.000000 0.000000 1.326231 1.326231 -3.978694
## C 0.000000 0.000000 1.326231 0.000000 0.000000 1.326231
## C
## A 0.000000
## A+B 0.000000
## A+B+C 0.000000
## A+C 0.000000
## B 0.000000
## B+C 1.326231
## C -2.652463
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
## A A+B A+B+C A+C B B+C C
## 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571
## Warning in rstate(p/sum(p)): Some probabilities (slightly?) < 0. Setting p
## < 0 to zero.
## Done.
trees
## 100 phylogenetic trees with mapped discrete characters
We can visualize this in the typical way:
obj3<-summary(trees)
plot(obj3,cex=c(0.4,0.2),fsize=0.5,type="fan",
colors=setNames(colors,levels(y)))
add.simmap.legend(colors=setNames(colors,levels(y)),prompt=FALSE,
x=-0.95,y=-0.65,fsize=0.7)
but we also merge mapped states A
, A+B
,
A+C
, & A+B+C
into A
; and
the other states into not-A
and visualize the results. For
instance:
treesA<-mergeMappedStates(trees,c("A","A+B","A+C","A+B+C"),"A")
treesA<-mergeMappedStates(treesA,c("B","B+C","C"),"not-A")
dmap<-densityMap(treesA,plot=FALSE,states=c("not-A","A"))
## sorry - this might take a while; please be patient
plot(dmap,type="fan",fsize=c(0.5,0.8),outline=T)
or:
class(treesA)<-c("multiSimmap",class(treesA)) ## hack
obj4<-summary(treesA)
plot(obj4,cex=c(0.4,0.2),fsize=0.5,type="fan",
colors=setNames(c("blue","red"),c("not-A","A")))
add.simmap.legend(colors=setNames(c("blue","red"),c("not-A","A"))[2:1],
prompt=FALSE,x=-0.95,y=-0.9,fsize=0.7)
The data for this demo were simulated as follows.
First demo 1:
Q<-0.4*matrix(c(-5,1,1,1,1,1,
1,-5,1,1,1,1,
1,1,-5,1,1,1,
1,1,1,-5,1,1,
1,1,1,1,-5,1,
1,1,1,1,1,-5),6,6)
rownames(Q)<-colnames(Q)<-paste("T",1:6,sep="")
x<-sim.history(tree<-pbtree(n=200,scale=0.9),Q)$states
tree<-midpoint.root(bind.tip(tree,where=Ntip(tree)+1,
tip.label="OUTGROUP1",edge.length=1.1))
tree<-bind.tip(tree,where=Ntip(tree)+1,tip.label="OUTGROUP2")
x<-c(rep("primitive",2),x)
names(x)[1:2]<-paste("OUTGROUP",1:2,sep="")
x<-as.factor(x)
Now demo 2:
states<-c("A","A+B","A+C","A+B+C","B","B+C","C")
Q<-matrix(NA,7,7,dimnames=list(states,states))
Q[1,]<-c(-2,1,1,0,0,0,0)
Q[2,]<-c(1,-2,0,1,0,0,0)
Q[3,]<-c(1,0,-2,1,0,0,0)
Q[4,]<-c(0,1,1,-3,0,1,0)
Q[5,]<-c(0,1,0,0,-2,1,0)
Q[6,]<-c(0,0,0,1,1,-3,1)
Q[7,]<-c(0,0,1,0,0,1,-2)
y<-as.factor(sim.history(tree,Q,anc="A")$states)
y<-y[-grep("OUT",names(y))]
That's it.
No comments:
Post a Comment
Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.