A phytools user reported a strange result from ancestral character reconstruction. I don't have a fix, per se, but it was interesting to diagnose and so I decided to (with his permission) share it here.
Firstly, let's take a look at our data. This is a real empirical dataset, but the names of the taxa have been anonymized to conceal any pre-publication results of this research:
library(phytools)
tree
##
## Phylogenetic tree with 271 tips and 270 internal nodes.
##
## Tip labels:
## t1, t2, t3, t4, t5, t6, ...
##
## Rooted; includes branch lengths.
y
## t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 t12 t13 t14 t15
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## t16 t17 t18 t19 t20 t21 t22 t23 t24 t25 t26 t27 t28 t29 t30
## 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0
## t31 t32 t33 t34 t35 t36 t37 t38 t39 t40 t41 t42 t43 t44 t45
## 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0
## t46 t47 t48 t49 t50 t51 t52 t53 t54 t55 t56 t57 t58 t59 t60
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## t61 t62 t63 t64 t65 t66 t67 t68 t69 t70 t71 t72 t73 t74 t75
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## t76 t77 t78 t79 t80 t81 t82 t83 t84 t85 t86 t87 t88 t89 t90
## 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0
## t91 t92 t93 t94 t95 t96 t97 t98 t99 t100 t101 t102 t103 t104 t105
## 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
## t106 t107 t108 t109 t110 t111 t112 t113 t114 t115 t116 t117 t118 t119 t120
## 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0
## t121 t122 t123 t124 t125 t126 t127 t128 t129 t130 t131 t132 t133 t134 t135
## 1 0 0 1 0 0 0 1 1 1 1 0 0 0 0
## t136 t137 t138 t139 t140 t141 t142 t143 t144 t145 t146 t147 t148 t149 t150
## 0 1 1 0 0 1 0 1 1 0 1 1 1 1 1
## t151 t152 t153 t154 t155 t156 t157 t158 t159 t160 t161 t162 t163 t164 t165
## 1 1 0 1 1 0 1 1 1 0 1 1 0 1 0
## t166 t167 t168 t169 t170 t171 t172 t173 t174 t175 t176 t177 t178 t179 t180
## 1 1 1 0 0 0 1 0 1 0 1 0 0 0 0
## t181 t182 t183 t184 t185 t186 t187 t188 t189 t190 t191 t192 t193 t194 t195
## 0 0 0 1 1 0 0 0 1 0 0 0 1 1 0
## t196 t197 t198 t199 t200 t201 t202 t203 t204 t205 t206 t207 t208 t209 t210
## 0 0 0 0 0 1 1 0 1 1 0 0 0 1 1
## t211 t212 t213 t214 t215 t216 t217 t218 t219 t220 t221 t222 t223 t224 t225
## 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0
## t226 t227 t228 t229 t230 t231 t232 t233 t234 t235 t236 t237 t238 t239 t240
## 0 0 1 1 1 0 0 1 0 1 1 0 1 1 1
## t241 t242 t243 t244 t245 t246 t247 t248 t249 t250 t251 t252 t253 t254 t255
## 1 1 1 0 1 1 0 0 0 0 1 0 0 0 0
## t256 t257 t258 t259 t260 t261 t262 t263 t264 t265 t266 t267 t268 t269 t270
## 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
## t271
## 0
## Levels: 0 1
We can plot these on the tree just as easily:
plotTree(tree,type="fan",lwd=1,ftype="off")
tiplabels(pie=to.matrix(y[tree$tip.label],c(0,1)),
piecol=c("blue","red"),cex=0.3)
add.simmap.legend(colors=setNames(c("blue","red"),
c(0,1)),x=-176,y=169,prompt=FALSE,shape="circle")
and we see a character pattern that seems consistent with relatively conservative evolution of the trait.
However, when we estimate ancestral states using ace
a
very peculiar result emerges - basically all the internal nodes have
marginal likelihoods of 0.5
of being in each of the two
states:
plotTree(tree,type="fan",lwd=1,ftype="off")
tiplabels(pie=to.matrix(y[tree$tip.label],c(0,1)),
piecol=c("blue","red"),cex=0.3)
add.simmap.legend(colors=setNames(c("blue","red"),
c(0,1)),x=-176,y=169,prompt=FALSE,shape="circle")
fit.ace<-ace(y,tree,type="discrete")
nodelabels(pie=fit.ace$lik.anc,piecol=c("blue",
"red"),cex=0.4)
We also find the same result using stochastic mapping:
mtrees<-make.simmap(tree,y,nsim=100)
## make.simmap is sampling character histories conditioned on the transition matrix
##
## Q =
## 0 1
## 0 -1.100016 1.100016
## 1 1.100016 -1.100016
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
## 0 1
## 0.5 0.5
## Done.
plot(mtrees[[1]],colors=setNames(c("blue","red"),c(0,1)),
ftype="off",type="fan")
It might be hard to tell, but there are (quite literally) thousands of changes between states mapped on this tree! When we use our stochastic maps to reconstruct posterior probabilities at nodes, the results converge on what we saw before:
obj<-summary(mtrees)
obj
## 100 trees with a mapped discrete character with states:
## 0, 1
##
## trees have 14970.19 changes between states on average
##
## changes are of the following types:
## 0,1 1,0
## x->y 7453.55 7516.64
##
## mean total time spent in each state is:
## 0 1 total
## raw 6834.9706198 6772.2170663 13607.19
## prop 0.5023059 0.4976941 1.00
plot(obj,type="fan",ftype="off",lwd=1,cex=c(0.4,0.3),
colors=setNames(c("blue","red"),c(0,1)))
So, what's going on?
Well, it turns out we can show that the fitted rate of transition between
states obtained by ace
(or by fitMk
, which is
used by make.simmap
internally) is too high.
fit.ace
##
## Ancestral Character Estimation
##
## Call: ace(x = y, phy = tree, type = "discrete")
##
## Log-likelihood: -187.1497
##
## Rate index matrix:
## 0 1
## 0 . 1
## 1 1 .
##
## Parameter estimates:
## rate index estimate std-err
## 1 1.1 25.9547
##
## Scaled likelihoods at the root (type '...$lik.anc' to get them for all nodes):
## 0 1
## 0.5 0.5
fit.phytools<-fitMk(tree,y)
fit.phytools
## Object of class "fitMk".
##
## Fitted (or set) value of Q:
## 0 1
## 0 -1.100016 1.100016
## 1 1.100016 -1.100016
##
## Fitted (or set) value of pi:
## 0 1
## 0.5 0.5
##
## Log-likelihood: -187.842892
We'll do this by visualizing the likelihood surface from a number just above zero, to a value about 10% larger than the estimated rate, as follows:
q<-seq(0.001,1.2,by=0.001)
logL<-sapply(q,function(q,tree,x)
logLik(fitMk(tree,x,fixedQ=matrix(c(-q,q,q,-q),2,2))),
tree=tree,x=y)
plot(q,logL,type="l",xlab="q(0,1) & q(1,0)",ylab="log(L)")
abline(v=fit.phytools$rates,lty="dashed")
It's pretty plain to see that the likelihood is not maximized at q0,1 = 1.1, but rather at a much smaller value:
ii<-which(logL==max(logL))
q[ii]
## [1] 0.005
Furthermore, the reason optimization goes awry is because, after the initial peak, the likelihood monotically increases for increasing values of the transition rate, q.
Now, if we force the transition rate to this better value, we obtain much more reasonable results. For instance:
q<-q[ii]
Q=matrix(c(-q,q,q,-q),2,2)
rownames(Q)<-colnames(Q)<-c(0,1)
mtrees<-make.simmap(tree,y,nsim=100,Q=Q)
## make.simmap is sampling character histories conditioned on the transition matrix
##
## Q =
## 0 1
## 0 -0.005 0.005
## 1 0.005 -0.005
## (specified by the user);
## and (mean) root node prior probabilities
## pi =
## 0 1
## 0.5 0.5
## Done.
plot(mtrees[[1]],colors=setNames(c("blue","red"),c(0,1)),
ftype="off",type="fan")
or:
obj<-summary(mtrees)
obj
## 100 trees with a mapped discrete character with states:
## 0, 1
##
## trees have 68.06 changes between states on average
##
## changes are of the following types:
## 0,1 1,0
## x->y 46.35 21.71
##
## mean total time spent in each state is:
## 0 1 total
## raw 1.091646e+04 2690.7285272 13607.19
## prop 8.022568e-01 0.1977432 1.00
plot(obj,type="fan",ftype="off",lwd=1,cex=c(0.4,0.3),
colors=setNames(c("blue","red"),c(0,1)))
In this case, fitDiscrete
in the geiger package does converge
on the right solution:
library(geiger)
fitDiscrete(tree,y)
## GEIGER-fitted comparative model of discrete data
## fitted Q matrix:
## 0 1
## 0 -0.004976043 0.004976043
## 1 0.004976043 -0.004976043
##
## model summary:
## log-likelihood = -155.337575
## AIC = 312.675150
## AICc = 312.690020
## free parameters = 1
##
## Convergence diagnostics:
## optimization iterations = 100
## failed iterations = 0
## frequency of best fit = 1.00
##
## object summary:
## 'lik' -- likelihood function
## 'bnd' -- bounds for likelihood search
## 'res' -- optimization iteration summary
## 'opt' -- maximum likelihood parameter estimates
Finally, we also get the right ancestral states if we just rescale the tree. This is because then the estimated transition rates are much higher, but the likelihood & posterior probabilities at nodes remain unchanged. For instance:
RESCALE<-function(tree,h=1){
tree$edge.length<-tree$edge.length/max(nodeHeights(tree))*h
tree
}
fit.rescale<-ace(y,RESCALE(tree),type="discrete")
fit.rescale
##
## Ancestral Character Estimation
##
## Call: ace(x = y, phy = RESCALE(tree), type = "discrete")
##
## Log-likelihood: -155.2032
##
## Rate index matrix:
## 0 1
## 0 . 1
## 1 1 .
##
## Parameter estimates:
## rate index estimate std-err
## 1 0.8832 0.1055
##
## Scaled likelihoods at the root (type '...$lik.anc' to get them for all nodes):
## 0 1
## 0.93206197 0.06793803
plotTree(tree,type="fan",lwd=1,ftype="off")
tiplabels(pie=to.matrix(y[tree$tip.label],c(0,1)),
piecol=c("blue","red"),cex=0.3)
add.simmap.legend(colors=setNames(c("blue","red"),
c(0,1)),x=-176,y=169,prompt=FALSE,shape="circle")
nodelabels(pie=fit.rescale$lik.anc,piecol=c("blue",
"red"),cex=0.4)
Now to figure out what can be done to ensure that fitMk
finds
the right solution in the first place!
That third tree is hilarious. So, should we be concerned that similar errors have gone undetected, or is this situation a one-off? Good example of the importance of visualizing data (and likelihoods) early and often.
ReplyDeleteI agree.
DeleteYes, we should be worried - but I have also seen cases where the likelihood surface seems to just monotonically increases with q, producing similarly bizarre reconstructions, but with no peak at a lower rate as in this case (and thus no obvious failure of optimization). Klaus Schliep & I have discussed writing a paper about such cases. We will see if that happens.