Friday, September 21, 2012

New version of fitDiversityModel with vastly improved ML optimization & correct covariance matrix for parameter estimates

I just posted a new version of the function fitDiversityModel. This function does the main part of the analysis of Mahler et al. (2010), but with some modifications. Most importantly, Mahler et al. use REML; whereas this function does ML. (In addition, in the paper there were some issues in the way in which we computed ecological opportunity at internal nodes. This has been fixed in the form of the phytools function estDiversity - although this is not directly relevant to fitDiversityModel, which can theoretically fit our opportunity model with ecological lineage densities from any source.)

This function is designed to fit a model of character evolution for a single trait in which the rate of Brownian evolution changes as a function of the lineage density** at internal nodes of the tree (**or any other feature that we can track for nodes on the tree). Consequently, the model has two main parameters of interest: σ02, the starting rate of evolution; and what we called "ψ", the rate of gain or loss in the BM rate per unit of change in ecological opportunity. If ψ < 0, this would indicate that the rate of evolution declined as lineages were added within an ecological community.

In using the previous version of fitDiversityModel for some new analyses, Luke Mahler discovered that fitDiversityModel was sensitive to the scaling of the tree. Theoretically, this should not be the case. For instance, if we scale the tree by a factor of 2, then we should expect σ02 and ψ to half. In fact, we can see this to be true (the data here are PC1 from the Mahler et al. paper, I believe).

> library(phytools)
> library(geiger)
> res1<-fitDiversityModel(rescaleTree(tree,1),x,d=density)
> res1
$logL
[1] -7.782838
$sig0
[1] 0.237636
$psi
[1] -0.003973024
$vcv
            sig0          psi
sig0  0.001622423 -0.002161093
psi  -0.002161093  0.009471374
$convergence
[1] TRUE
> res2<-fitDiversityModel(rescaleTree(tree,2),x,d=density)
> res2
$logL
[1] -7.782838
$sig0
[1] 0.118824
$psi
[1] -0.001986653
$vcv
            sig0          psi
sig0  0.000405508 -0.001080177
psi  -0.001080177  0.009469173
$convergence
[1] TRUE


[A quick sidenote on why we expect our likelihoods to be constant for rescaled trees - but not for rescaled data, for instance. The expected multivariate distribution of our tip data on the tree is always a function of the branch lengths of the tree × parameters in our evolutionary model. Thus, by scaling branch lengths up & evolutionary parameters down (or vice versa) we always end up the same multivariate probability density.]

OK, so far so good. But what about a more dramatic rescaling? If we scaled our tree by 100 instead of 2, we should just expect our parameter estimates to change by 1/100 instead of 1/2, right? Unfortunately, Luke discovered that as we rescale the branch lengths up it becomes increasingly difficult for optim, the base package multivariate optimizer, to converge on the correct solution. For instance:

> res100<-fitDiversityModel(rescaleTree(tree,100),x,d=density)
> res100
$logL
[1] -13.69297
$sig0
[1] 0.001646267
$psi
[1] 0
$vcv
             sig0          psi
sig0 -2.298157e-09 5.825806e-05
psi   5.825806e-05 3.639800e-01
$convergence
[1] FALSE


Initially, I tried to solve this by tweaking the control of optim, but then I realized that we had analytic solutions for σ02 and (the other parameter in the model, the ancestral state at the root node) a conditioned on ψ - so the problem could pretty easily be reformulated as a univariate, rather than multivariate, optimization problem. This could pretty much immediately take care of optimization issues, because the univariate optimizer (optimize) is much more reliable and quicker than optim.

So, I have updated fitDiversityModel to perform univariate optimization of ψ & use conditional MLEs for σ02 and a, and it works great! The code for this function is here, but I would generally advise installing the new version of phytools from source (here). This is because phytools imports a function, hessian, from the 'numDeriv' package, via the NAMESPACE, and this will generally not be available to fitDiversityModel unless it is loaded as part of the package.

[Note that although you can detach phytools to install and load the new version, I find that it is generally better to start a new R session here.]

> install.packages("phytools_0.1-96.tar.gz",type="source", repos=NULL)
> library(phytools)
> library(geiger)
> res1<-fitDiversityModel(rescaleTree(tree,1),x,d=density)
> res1
$logL
[1] -7.782838
$sig0
[1] 0.2376324
$psi
[1] -0.003972944
$vcv
             sig0           psi
sig0  0.0016200023 -3.757170e-05
psi  -0.0000375717  1.027241e-06
> res100<-fitDiversityModel(rescaleTree(tree,100),x,d=density)
> res100
$logL
[1] -7.782838
$sig0
[1] 0.002376324
$psi
[1] -3.972944e-05
$vcv
             sig0           psi
sig0  1.620002e-07 -3.757170e-09
psi  -3.757170e-09  1.027241e-10


Now the function is behaving a lot better - users should also notice that it is markedly faster (about 3 × faster if showTree is turned off).

A second thing the user might notice, though, is that the covariance matrix of the parameter estimates from the Hessian has changed as well. What happened? Well, it turns out this was a real bug with fitDiversityModel in its prior iterations. [I'm not characterizing the other problem as a real bug because fitDiversityModel warned us that it had not converged!] The problem was that internally fitDiversityModel was rescaling ψ for better optimization; and although ψ was backtransformed to its original scale for output, the variance of ψ was not! This is a pretty significant oversight, but it should've only affected people using $vcv to compute, say, a 95% confidence interval on model parameters (likelihood-ratio tests and AIC would not have been affected).

I also made one final change to the function - also affecting how ψ is rescaled internally. Basically, this will only affect your results if the range of values in d was very small, but the effect of d was large. We discovered this problem when we used "time since the root" as our d vector. In this case if the total tree length was short (e.g., 1.0 instead of, say, 100) parameter estimates could be off.

3 comments:

  1. Great post Liam. This appears to work seamlessly now.

    A quick note, in case anyone's interested: Liam mentioned using "time since the root" as our "d" vector. The reason to do this is to fit the "Time Model" from our 2010 Evolution paper. Creating a "time since the root" vector for a particular tree is very easy using the ape function "branching.times()". Here's how:

    time.vector <- max(branching.times(tree))-branching.times(tree)

    Then type the following to fit the "Time Model":

    time.model.results <- fitDiversityModel(tree,x,d=time.vector)

    And that should do it!

    ReplyDelete
  2. Hello,

    Thank you for your very useful explanations! But I have one short question. In the paper by L. Mahler & al. they also tested the model with a single evolutionary rate. I’m asking me if it is “enough” to write it like that:
    Constant.time.results <- fitDiversityModel(tree,x,d=NULL)

    I know the help page said “if d=NULL (the default) function will treat the diversification as if it occurred in a single geographic area”. But does it not return it to the same?


    Thank you!!!

    ReplyDelete
    Replies
    1. I think that this should answer your question. Good luck! Liam

      Delete