Tuesday, November 5, 2013

Reconstructed ancestral & tip states for a continuous character evolving by Brownian motion with missing data

I'm working on a short comment (perhaps for publication, perhaps not) on ancestral state reconstruction methods when data for some tips are missing or uncertain. The reason for this is twofold: (1) enough people have asked me about this that it seems some kind of literature reference might be handy; and (2) little to no modification of existing approaches is required to reconstruct ancestral states when some tip states are uncertain or unknown. For instance, this is a routine matter in the reconstruction of ancestral DNA sequences when some tip states are ambiguous.

Possibly the least interesting case for reconstructing ancestral states when some of the data for tips are unknown is ML estimation of ancestral values of a continuous character under constant-rate Brownian motion. This is (1) because ML ancestral states for the parent nodes of lineages leading to tip taxa with missing are exactly the same as the states we'd obtain by linearly interpolating on the basis of the states at subtending nodes, and (2) because ML states for the missing tips in the tree are (under BM which has an expected change of zero with variance σ2t along any edge of length t, remember) identical to the reconstructed state at the parent node.

Uninteresting as it might be in most cases, we can nonetheless do this. I just modified the phytools function anc.ML to attempt this if some tips in the tree are missing from the data vector, x. Here's a simple demo to show what I mean:

> library(phytools)
> source("anc.ML.R")
> tree<-pbtree(n=26,scale=1,tip.label=LETTERS)
> yy<-x<-fastBM(tree)
> x<-sample(x,13)
> X<-anc.ML(tree,x)
> setdiff(names(yy),names(x))
 [1] "B" "D" "F" "J" "K" "L" "M" "N" "R" "T" "U" "V" "X"
> phenogram(tree,c(x,X$missing.x,X$ace),spread.labels=TRUE, spread.cost=c(1,0))

If you look closely at the list of missing taxa returned by setdiff and the plotted traitgram you'll see (just as promised) the reconstructed phenotypes of missing tips are connected to their ancestors (interpolated perfectly along the subtending branch) by precisely horizontal lines.

That's it.


  1. Huh, I feel like I knew this back in my head, but it had never really come together as a complete thought that, oh, of course the reconstructed ancestor and the reconstructed descendant has to be identical under BM because of the expectation of no change. Neat!

    I guess I'll add that to my list of interesting examples where the maximum likelihood solution is almost certainly wrong. In paleontology, there's a nice one where if you assume sampling is a Poisson process like process (just like birth and death in a birth-death model), then the last time you observe a taxon in the fossil record is most likely when it went extinct. This is because the ML waiting time to extinct is zero, since waiting times are exponentially distributed. (The same situation exists for first appearances and true times of origination.) But this is almost certainly wrong in every potential case, real or simulated...

    1. Indeed, although in this case the MLEs are wrong but unbiased.

    2. Yeah, the paleo-stratigraphy case with origination and extinction is definitely worse!

  2. Using ancestral nodes to predict tips with unknown trait values is shown in this paper (via PIC).


    I can send over the code if anyone is interested

    1. Hello pearsy,
      I am interested in predicting tip traits based on ancestral nodes. Thing is, I am dealing with categorical data, not continuous. I don't know if your code will be applicable. But can I have a copy nonetheless?