Friday, July 1, 2011

Null distribution by simulation for brownie.lite()

I just posted a new version of brownie.lite() on my R-phylogenetics page (direct link to code here). Just to remind you, this function implements the non-censored evolutionary rates test of O'Meara et al. (2006). In the method we fit two different models using the technique of maximum likelihood - one in which one evolutionary rate prevails, and another in which the evolutionary rate shifts one or multiple times between two or more evolutionary rates on different branches of the tree. We can then compare the fit of these models by, for instance, comparing 2 × their log-likelihood ratio to a Χ2-distribution.

We could also compare the likelihood ratio to likelihood ratios obtained by simulating under the null hypothesis. (In fact, I believe that the authors suggested this in their original paper.) I have now implemented this test in brownie.lite(). I am also presently conducting a mini-experiment to compare the type I error rates of each method. Since the likelihood ratio is only asymptotically Χ2-distributed for large n, we might expect that the simulation test would be more conservative than the Χ2 test. So far I have not found a lot of evidence to suggest that it is for n=10 and n=20.

I'll also confess to a small fudge in the implementation of this method. Occasionally (i.e., about 1% of the time or so), brownie.lite() fails to converge during optimization. Rather than fix this, to avoid this problem when trying to compute the P-value by simulation, I just discarded any simulated datasets that were troublesome to optimize. I don't see how this would result in any bias, but I don't know. In the near future, I will fix this problem in brownie.lite() (and hopefully clean up the function in general, as it is pretty messy).

If you'd like to participate in my mini experiment, I encourage you to download the new version of brownie.lite() and try and run the following code (hopefully increasing nrep and varying nsp. Please post the results.

source("brownie.lite.R") # load new version of brownie.lite()
typeI.x2<-0; typeI.sim<-0
nrep<-100; nsp=20
for(i in 1:nrep){
  # simulate tree
  tree<-drop.tip(birthdeath.tree(b=1,d=0,taxa.stop=(nsp+1)), as.character(nsp+1))
  # rescale tree
  # simulate rate regime
  x<-sim.char(tree,model.matrix=list(matrix(c(-1,1,1,-1),2,2)), model="discrete")[,,1]
    x<-sim.char(tree,model.matrix=list(matrix(c(-1,1,1,-1), 2,2)),model="discrete")[,,1]
  # simulate data under the null
  # fit the model & compute X^2 P-value
  # fit the model & compute simulation P-value
  # count the frequence of P<0.05
  # something to look at while this is running
  # (should converge to 0.05)
  print(paste("X^2 type I error:",typeI.x2*nrep/i, "; Simulation type I error:",typeI.sim*nrep/i))


  1. Not sure if you wanted to see the log or just summaries, but I did a couple:

    Tried to do longer ones with more species but my poor laptop can't handle it... hope it helps!

  2. @Rafa - Cool, can you describe how those plots were generated?

  3. Sure, I just created two vectors before the loop:

    > TIx2=NULL; TIsim=NULL

    then inside the loop I filled the vectors with the same equation that was being printed during each run:

    > TIx2[i]=typeI.x2*nrep/i
    > TIsim[i]=typeI.sim*nrep/i
    > print(paste(i,": X^2 type I error:",typeI.x2*nrep/i, "; Simulation type I error:",typeI.sim*nrep/i))

    Then, the density plots:

    > plot(density(TIx2),xlab="Type I error",main="nrep = 500, nsp = 20",xlim=c(-0.01,0.15),col="blue",lwd=2)
    > lines(density(TIsim),col="red",lwd=2)
    > abline(v=0.05,lty=2)
    > legend("topright", legend=c("Xˆ2","Simulation"), lty=c(1,1), col=c("blue","red"))

    I also loaded from source brownie.lite() from the link in this post, and make.simmap() from the previous post in your blog.

    hope that's the actual statistic you were interested in. If there's any other particular simulation scenario I can help with, just let me know. cheers!

  4. This comment has been removed by the author.

  5. Just realized that the printed arguments are updated at each loop run to add the current run to the previous, and aren't independent values, so I guess my plots are pretty meaningless! Ack. Sorry.

    In that case, the last printed values for those were:

    nrep<-100; nsp=20
    "100 : X^2 type I error: 0.1 ; Simulation type I error: 0.05"

    nrep<-500; nsp=20
    "500 : X^2 type I error: 0.062 ; Simulation type I error: 0.064"

    nrep<-100; nsp=50
    [1] "100 : X^2 type I error: 0.06 ; Simulation type I error: 0.05"

    Hope that's more helpful! Cheers!

  6. Hi Liam,

    Do you have a sense for how fast/slow the simulated null runs? Also, how many simulated points does one need to obtain a well-sampled simulated null?



Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.