Thursday, October 20, 2016

Estimation of missing tip states under Brownian motion using likelihood

As I have noted previously, it is possible (and relatively straightforward) to reconstruct missing tip values under Brownian motion by jointly maximizing the likelihood of internal and (missing) terminal states using likelihood.

This is implemented in the phytools function anc.ML.

One attribute of these reconstructed values is that they will precisely match the reconstructed ancestral node state of their immediate parent. This is totally unsurprising because under Brownian motion because the expected change over any time period is zero (with variance equal to the product of the Brownian rate, σ2, and the edge length).

We can visualize this phenomenon using both phenogram and contMap. Here's what I mean.

First, simulate a tree & some data:

library(phytools)
tree<-pbtree(n=26,tip.label=LETTERS,scale=1)
x<-fastBM(tree)

Subsample the data so there is missing information for some tips:

xp<-sample(x,20)

Fit our ancestral states & missing terminals using ML:

fit<-anc.ML(tree,xp)
fit
## Ancestral character estimates using anc.ML under a BM model:
##        27        28        29        30        31        32        33 
## -0.851047 -0.824691 -0.884692 -1.184654 -1.128078 -1.402134 -1.580146 
##        34        35        36        37        38        39        40 
## -1.642514 -1.551363 -1.477286 -1.374941 -0.878525 -1.194390  0.306645 
##        41        42        43        44        45        46        47 
##  0.450288  0.539937  0.568322  0.527984 -0.806273 -0.753233 -0.514332 
##        48        49        50        51 
## -0.251404 -1.112156 -0.722011 -1.412652 
## 
## Fitted model parameters & likelihood:
##      sig2 log-likelihood
##  0.468265       5.497266
## 
## R thinks it has found the ML solution.

Note that reconstructed missing tip values match their parents (to fairly high numerical precision - this is obtained via numerical optimization, remember):

tips<-sapply(names(fit$missing.x),function(x,y)
    which(y==x),y=tree$tip.label)
library(phangorn)
parents<-sapply(tips,Ancestors,x=tree,type="parent")
X<-rbind(fit$missing.x,fit$ace[as.character(parents)])
rownames(X)<-c("daughter","parent")
X
##                  H         I         K         L         O          W
## daughter -1.580114 -1.402163 -1.194513 0.3066923 0.5279829 -0.7220155
## parent   -1.580146 -1.402134 -1.194390 0.3066445 0.5279840 -0.7220109

Now let's visualize it - first with phenogram:

x.all<-c(xp,fit$missing.x,fit$ace)
tree<-paintBranches(tree,tips,"2")
phenogram(tree,x.all)
nodelabels(node=tips,pie=rep(1,length(fit$missing.x)),
    cex=0.5)

plot of chunk unnamed-chunk-5

or with contMap:

obj<-contMap(tree,x.all,method="user",anc.states=fit$ace,
    plot=FALSE)
obj<-setMap(obj,colors=c("white","black"))
plot(obj)
nulo<-sapply(names(fit$missing.x),add.arrow,tree=obj,
    arrl=0.04,hedl=0.02,col="red")

plot of chunk unnamed-chunk-6

Finally, let's repeat the whole thing just to show it wasn't a fluke:

tree<-pbtree(n=26,tip.label=LETTERS,scale=1)
x<-fastBM(tree)
xp<-sample(x,20)
fit<-anc.ML(tree,xp)
tips<-sapply(names(fit$missing.x),function(x,y)
    which(y==x),y=tree$tip.label)
parents<-sapply(tips,Ancestors,x=tree,type="parent")
X<-rbind(fit$missing.x,fit$ace[as.character(parents)])
rownames(X)<-c("daughter","parent")
X
##                 B         L         N        R         V         Y
## daughter 1.430460 0.9306579 0.9480089 1.449290 0.2423973 0.4115517
## parent   1.430375 0.9306499 0.9480131 1.449279 0.2425937 0.4116535
x.all<-c(xp,fit$missing.x,fit$ace)
tree<-paintBranches(tree,tips,"2")
phenogram(tree,x.all)
nodelabels(node=tips,pie=rep(1,length(fit$missing.x)),
    cex=0.5)

plot of chunk unnamed-chunk-7

obj<-contMap(tree,x.all,method="user",anc.states=fit$ace,
    plot=FALSE)
obj<-setMap(obj,colors=c("white","black"))
plot(obj)
nulo<-sapply(names(fit$missing.x),add.arrow,tree=obj,
    arrl=0.04,hedl=0.02,col="red")

plot of chunk unnamed-chunk-8

Recently, it was reported to me that for some empirical dataset this has not been observed to be the case. The most likely explanation for this is non- convergence of the ML optimization by anc.ML. This can be ameliorated by increasing maxit, but this may not guarantee convergence on relatively large trees. I'm sure that a better solution exists, unfortunately I have not worked on this in a bit.

No comments:

Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.