Wednesday, February 1, 2012

New version of brownie.lite accounting for sampling error

I just posted a new version of the phytools function brownie.lite which does the analysis of O'Meara et al. (2006), but also accounts for sampling error in the estimation of species means (basically following Ives et al. 2007).

Because of the gradual and haphazard way that I had been adding features to brownie.lite, the previous version of the function was become excessively messy & cumbersome. To remedy that problem, and to add this new feature to the function, I decided to totally re-program it. To see this, compare the new version here to its most recent predecessor (here). I did pull some code snippets from the original version and from other functions, such as evolvcv.lite.

The way it works is straightforward. In the case of, say, a single rate of evolution on all the branches of the tree, we optimize the following function for the log-likelihood (expressed in terms of the ancestral state at the root node, the evolution rate σ2, and the cophenetic matrix containing the heights above the root for the MRCAs of all the terminal species in the tree):

lik.single<-function(theta,y,C,se){
  n<-length(y)
  sig<-theta[1]
  a<-theta[2]
  E<-diag(se^2)
  logL<-as.numeric(-t(y-a)%*%solve(sig*C+E)%*%(y-a)/2 -n*log(2*pi)/2 -determinant(sig*C+E)$modulus[1]/2)
  return(-logL)
}


If we don't have sampling error, or we must choose to ignore it, no problem - we just feed the function a vector of zeroes and we are back to optimizing the model of O'Meara et al. & all prior versions of brownie.lite.

To see how this work, let's first simulate some data under the model:

> set.seed(1) # set seed for repeatability
> Q<-matrix(c(-1,1,1,-1),2,2) # model for discrete character
> tree<-sim.history(pbtree(n=300,scale=1),Q,anc=1) # simulate rate history
> x<-sim.rates(tree,c(1,10)) # simulate with two rates
> se<-rchisq(n=300,df=1) # species-specific sampling errors
> xe<-rnorm(n=300,mean=x,sd=se) # simulate means estimated with error
> names(xe)<-names(se)<-names(x)


Now, let's fit the O'Meara et al. model under three scenarios: first, let's use the true (in this case) known species means; second, let's use the "estimated" means (i.e., the means simulated with error) and fit the model, but fail to account for sampling error; and, finally, third, let's fit the O'Meara et al. model but accounting for sampling error in the estimation of species means. (In the output below I've excised some details for brevity.)

> model1<-brownie.lite(tree,x)
> model2<-brownie.lite(tree,xe)
> model3<-brownie.lite(tree,xe,se=se)
> model1
$sig2.single
[1] 4.198556
...
$logL1
[1] -375.638
$k1
[1] 2
$sig2.multiple
         1          2
0.8732808 10.7110750
...
$logL.multiple
[1] -287.0903
$k2
[1] 3
$P.chisq
[1] 2.08773e-40
...
> model2
$sig2.single
[1] 36.49996
...
$logL1
[1] -700.0235
$k1
[1] 2
$sig2.multiple
       1        2
19.60967 64.56690
...
$logL.multiple
[1] -673.3162
$k2
[1] 3
$P.chisq
[1] 2.700823e-13
...
> model3
$sig2.single
[1] 3.927711
...
$logL1
[1] -516.407
$k1
[1] 2
$sig2.multiple
         1          2
0.9975472 10.5384272
...
$logL.multiple
[1] -480.9778
$k2
[1] 3
$P.chisq
[1] 3.837931e-17
...


The most salient features of this analysis are, in my opinion, as follows: (1) the estimated rates for both model 1 (true species means) & model 3 (estimated means, but accounting for uncertainty) are quite close to the generating values; (2) the estimated rates for model 2 (estimated species means, not accounting for error) are too high, although the general trend (σ22>>σ12) is still evident; and (3) the "lost information" inherent in substituting estimated species means for their true values results in a decrease in power - that is, the log-likelihood difference between the one and two-rate models is much smaller (and the P-value larger) in models 2 & 3 (although in this case, the effect is so strong and the tree so large that all models are highly significant).

I have added this to a new, non-static version of phytools (v0.1-63) which can be downloaded here and installed from source.

Finally, I realized in updating this function that the calculation of the P-value for brownie.lite(...,test="simulation") is wrong for previous versions of brownie.lite, so anyone using that option should definitely update phytools!

5 comments:

  1. hi Liam, once again thanks for putting this together, really seems like an important step forward in expanding the applicability of these models. One question: is there any chance of having brownie.lite also simulate OU models? This is (to me, at least) the only reason I'm still using the original Brownie software at this point (since with OUCH, as far as I understand, the regimes must be determined at nodes and not in a manner similar to SIMMAP, so they can't really be directly compared)... Thanks!

    ReplyDelete
  2. The short answer is yes, but I'm not sure when it will get done.

    ReplyDelete