Monday, October 27, 2014

Optimizing tree transformations for multiple discrete characters

A recent R-sig-phylo subscriber asked:

“I have a tree and discrete data (number of gene copies, for many genes) and would like to use the fitDiscrete function in geiger, or something similar. However, I would like to estimate the parameters given all of the datasets, not just with the data for each gene. For instance, if I was using the "delta” model to vary rates across the tree, I would like this delta value to reflect some sort of summary value across all datasets. Does anyone have an idea as to how this could be accomplished or perhaps point me in the right direction?“

Brian O'Meara pointed out that it would be straighforward to write a likelihood function that just summed the log-likelihoods across characters for different tree transformations and then wrapped the function in an optimizer. The following illustrates this explicitly.

First load packages - for this we need ape, phytools (just for simulation here), and geiger (for our tree rescaling):

## load packages
library(ape)
library(phytools)
## Loading required package: maps
library(geiger)

Now let's simulate some data that would be analogous to what we'd load from file in the empirical case:

## simulate tree & data
Q<-matrix(c(-1,1,1,-1),2,2)
rownames(Q)<-colnames(Q)<-letters[1:2]
tree<-pbtree(n=100,scale=1)
X<-replicate(10,sim.history(rescale(tree,model="lambda",lambda=0.5),Q)$states)
## Done simulation(s).
## Done simulation(s).
## Done simulation(s).
## Done simulation(s).
## Done simulation(s).
## Done simulation(s).
## Done simulation(s).
## Done simulation(s).
## Done simulation(s).
## Done simulation(s).
## this is what our data look like:
X
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## t64  "a"  "a"  "a"  "a"  "b"  "a"  "b"  "a"  "a"  "a"  
## t97  "a"  "b"  "b"  "a"  "a"  "b"  "b"  "b"  "a"  "a"  
## t98  "b"  "a"  "a"  "b"  "b"  "b"  "b"  "b"  "b"  "a"  
## t43  "b"  "b"  "b"  "a"  "b"  "a"  "a"  "b"  "a"  "b"  
## t44  "b"  "a"  "a"  "b"  "b"  "b"  "a"  "a"  "a"  "b"  
## t23  "b"  "a"  "b"  "a"  "b"  "a"  "a"  "a"  "b"  "a"  
## t85  "b"  "a"  "b"  "b"  "a"  "a"  "b"  "b"  "b"  "b"  
## t86  "b"  "a"  "b"  "b"  "b"  "b"  "a"  "b"  "a"  "a"  
## t68  "a"  "a"  "b"  "a"  "a"  "b"  "a"  "b"  "a"  "a"  
## t83  "b"  "a"  "b"  "b"  "b"  "a"  "a"  "b"  "a"  "b"  
## t84  "a"  "a"  "b"  "a"  "b"  "a"  "b"  "a"  "b"  "b"  
## t2   "a"  "a"  "a"  "a"  "b"  "b"  "a"  "b"  "b"  "a"  
## t3   "b"  "a"  "a"  "a"  "b"  "a"  "a"  "a"  "a"  "a"  
## t91  "b"  "b"  "b"  "a"  "a"  "a"  "b"  "a"  "a"  "a"  
## t92  "a"  "a"  "a"  "a"  "a"  "a"  "b"  "a"  "a"  "a"  
## t45  "b"  "a"  "b"  "a"  "a"  "b"  "a"  "a"  "a"  "b"  
## t46  "b"  "a"  "b"  "b"  "b"  "a"  "a"  "b"  "a"  "a"  
## t58  "b"  "b"  "a"  "a"  "a"  "a"  "a"  "a"  "a"  "a"  
## t59  "a"  "a"  "b"  "b"  "b"  "b"  "a"  "b"  "b"  "b"  
## t21  "b"  "b"  "b"  "b"  "b"  "a"  "a"  "a"  "b"  "b"  
## t60  "b"  "a"  "a"  "a"  "b"  "b"  "b"  "a"  "b"  "b"  
## t89  "a"  "a"  "b"  "a"  "b"  "b"  "a"  "a"  "a"  "b"  
## t90  "b"  "b"  "a"  "b"  "a"  "a"  "a"  "a"  "b"  "b"  
## t36  "b"  "a"  "b"  "a"  "b"  "b"  "b"  "b"  "b"  "a"  
## t27  "b"  "b"  "b"  "b"  "b"  "a"  "b"  "a"  "a"  "b"  
## t28  "a"  "b"  "b"  "b"  "b"  "b"  "a"  "a"  "b"  "b"  
## t29  "b"  "b"  "b"  "a"  "a"  "b"  "a"  "a"  "b"  "b"  
## t37  "b"  "b"  "b"  "b"  "b"  "a"  "a"  "a"  "b"  "b"  
## t38  "b"  "a"  "a"  "a"  "a"  "b"  "a"  "a"  "a"  "b"  
## t16  "a"  "a"  "b"  "b"  "a"  "a"  "a"  "b"  "a"  "a"  
## t17  "a"  "a"  "a"  "b"  "a"  "b"  "b"  "a"  "a"  "b"  
## t15  "b"  "a"  "a"  "a"  "a"  "b"  "b"  "a"  "b"  "b"  
## t47  "b"  "a"  "b"  "a"  "b"  "b"  "b"  "a"  "a"  "b"  
## t61  "b"  "a"  "b"  "a"  "b"  "a"  "a"  "b"  "a"  "b"  
## t62  "b"  "a"  "b"  "b"  "a"  "a"  "a"  "a"  "a"  "b"  
## t10  "b"  "a"  "b"  "b"  "b"  "b"  "a"  "b"  "a"  "b"  
## t11  "b"  "a"  "a"  "b"  "b"  "a"  "b"  "b"  "a"  "b"  
## t4   "b"  "b"  "b"  "a"  "a"  "b"  "a"  "b"  "a"  "a"  
## t13  "b"  "a"  "a"  "a"  "b"  "a"  "a"  "a"  "b"  "b"  
## t31  "b"  "a"  "a"  "b"  "a"  "b"  "b"  "a"  "a"  "a"  
## t93  "b"  "a"  "a"  "b"  "b"  "b"  "b"  "a"  "b"  "b"  
## t94  "b"  "a"  "a"  "a"  "b"  "a"  "a"  "a"  "b"  "a"  
## t8   "b"  "a"  "a"  "b"  "b"  "b"  "a"  "a"  "a"  "b"  
## t5   "a"  "a"  "b"  "b"  "b"  "b"  "b"  "b"  "a"  "b"  
## t42  "b"  "b"  "b"  "b"  "b"  "a"  "a"  "a"  "a"  "a"  
## t76  "a"  "b"  "b"  "a"  "b"  "a"  "b"  "a"  "a"  "a"  
## t77  "a"  "b"  "b"  "b"  "a"  "b"  "b"  "b"  "a"  "b"  
## t63  "a"  "a"  "b"  "b"  "b"  "b"  "a"  "a"  "b"  "a"  
## t95  "b"  "b"  "b"  "b"  "a"  "b"  "a"  "a"  "a"  "a"  
## t96  "b"  "a"  "a"  "b"  "a"  "b"  "a"  "a"  "b"  "a"  
## t32  "b"  "b"  "b"  "a"  "a"  "b"  "a"  "b"  "b"  "a"  
## t33  "b"  "b"  "a"  "b"  "b"  "b"  "b"  "b"  "b"  "a"  
## t81  "b"  "a"  "a"  "b"  "b"  "b"  "a"  "a"  "a"  "b"  
## t82  "b"  "a"  "a"  "b"  "b"  "b"  "a"  "a"  "a"  "b"  
## t24  "b"  "b"  "a"  "a"  "b"  "b"  "a"  "b"  "a"  "b"  
## t14  "a"  "a"  "a"  "a"  "a"  "a"  "a"  "b"  "a"  "b"  
## t25  "a"  "a"  "b"  "a"  "a"  "a"  "b"  "a"  "b"  "b"  
## t78  "a"  "a"  "b"  "b"  "a"  "a"  "b"  "b"  "b"  "b"  
## t79  "a"  "b"  "b"  "a"  "a"  "a"  "b"  "a"  "b"  "b"  
## t19  "b"  "b"  "a"  "b"  "b"  "a"  "a"  "b"  "a"  "a"  
## t50  "a"  "b"  "a"  "a"  "b"  "b"  "a"  "a"  "a"  "b"  
## t51  "b"  "a"  "b"  "a"  "a"  "a"  "a"  "a"  "a"  "a"  
## t26  "a"  "b"  "a"  "b"  "b"  "b"  "a"  "a"  "a"  "a"  
## t66  "a"  "a"  "b"  "a"  "b"  "a"  "a"  "b"  "a"  "a"  
## t67  "a"  "a"  "b"  "a"  "a"  "a"  "b"  "a"  "a"  "b"  
## t54  "a"  "b"  "b"  "a"  "b"  "a"  "b"  "a"  "a"  "a"  
## t55  "b"  "a"  "a"  "a"  "a"  "a"  "a"  "a"  "a"  "b"  
## t52  "a"  "a"  "a"  "a"  "b"  "b"  "b"  "b"  "b"  "b"  
## t53  "a"  "a"  "a"  "a"  "a"  "b"  "a"  "b"  "b"  "b"  
## t7   "b"  "a"  "b"  "a"  "b"  "a"  "a"  "a"  "b"  "b"  
## t18  "b"  "a"  "a"  "b"  "a"  "b"  "a"  "a"  "b"  "b"  
## t20  "b"  "b"  "b"  "b"  "a"  "b"  "b"  "b"  "a"  "a"  
## t69  "b"  "a"  "a"  "b"  "a"  "b"  "a"  "a"  "b"  "b"  
## t87  "b"  "b"  "b"  "b"  "b"  "b"  "a"  "b"  "b"  "b"  
## t88  "b"  "b"  "b"  "b"  "b"  "a"  "b"  "b"  "b"  "b"  
## t65  "b"  "b"  "b"  "b"  "a"  "b"  "b"  "b"  "a"  "a"  
## t74  "b"  "b"  "b"  "a"  "b"  "a"  "a"  "b"  "b"  "a"  
## t75  "a"  "a"  "a"  "a"  "a"  "b"  "a"  "b"  "a"  "b"  
## t39  "b"  "a"  "a"  "a"  "b"  "b"  "b"  "a"  "a"  "b"  
## t22  "a"  "a"  "a"  "a"  "a"  "b"  "b"  "a"  "a"  "b"  
## t34  "b"  "b"  "b"  "a"  "b"  "b"  "a"  "b"  "a"  "a"  
## t35  "a"  "a"  "a"  "b"  "b"  "a"  "b"  "a"  "a"  "b"  
## t70  "b"  "a"  "a"  "a"  "b"  "b"  "b"  "b"  "b"  "a"  
## t71  "a"  "a"  "b"  "a"  "b"  "a"  "a"  "a"  "a"  "b"  
## t99  "a"  "b"  "b"  "a"  "b"  "a"  "a"  "b"  "b"  "b"  
## t100 "b"  "a"  "a"  "a"  "b"  "a"  "a"  "b"  "a"  "b"  
## t80  "b"  "a"  "a"  "a"  "b"  "a"  "a"  "b"  "a"  "b"  
## t6   "a"  "a"  "a"  "a"  "a"  "b"  "a"  "b"  "a"  "a"  
## t12  "b"  "a"  "a"  "a"  "a"  "b"  "a"  "a"  "b"  "b"  
## t30  "b"  "a"  "a"  "b"  "a"  "a"  "a"  "b"  "b"  "b"  
## t56  "b"  "b"  "a"  "a"  "b"  "b"  "a"  "a"  "b"  "b"  
## t57  "b"  "a"  "a"  "b"  "a"  "b"  "a"  "b"  "b"  "a"  
## t40  "a"  "b"  "b"  "a"  "a"  "b"  "b"  "b"  "a"  "b"  
## t41  "b"  "a"  "b"  "a"  "b"  "b"  "a"  "b"  "b"  "a"  
## t1   "a"  "b"  "b"  "a"  "b"  "a"  "a"  "b"  "b"  "b"  
## t9   "a"  "b"  "b"  "a"  "a"  "b"  "a"  "b"  "b"  "b"  
## t48  "b"  "a"  "a"  "a"  "b"  "b"  "a"  "b"  "b"  "b"  
## t49  "b"  "a"  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "b"  
## t72  "b"  "a"  "a"  "a"  "a"  "b"  "a"  "b"  "b"  "b"  
## t73  "a"  "a"  "a"  "b"  "b"  "b"  "a"  "b"  "a"  "b"

Note that it is arbitrary, and unnecessary, that I have simulated using the same transition matrix Q for each data column here.

Finally, our tree-transformation fitting function. Here, I've used ace from ape internally to compute the likelihood of each character for each tree transformation. One could instead use fitDiscrete (which is slower, but perhaps more robust) or function in the phangorn package.

## here is our model fitting function
fitTransform<-function(tree,X,model,interval){
    lk<-function(par,tree,X,model){
        sum(apply(X,2,function(x,tree,model,par) 
            logLik(ace(x,rescale(tree,model,par),type="discrete")),
            tree=tree,model=model,par=par))
    }
    fit<-optimize(lk,interval,tree=tree,X=X,model=model,maximum=TRUE)
    return(list(par=fit$maximum,model=model,logLik=fit$objective))
}
obj<-fitTransform(tree,X,model="lambda",interval=c(0,1))
## here is our fitted model
obj
## $par
## [1] 0.472
## 
## $model
## [1] "lambda"
## 
## $logLik
## [1] -666.7

We could use the same function with different values of model and (necessarily, if so) interval. I use model="lambda" merely to illustrate the case.

That's it.

3 comments:

  1. Hell Liam,

    I'm the one who asked the question...figured this would be an appropriate place to continue the conversation, if you don't mind.

    What if I am specifically interested in the delta value or the 'a' value (in the EB model) and not just the logLik? I modified the above code to use fitDiscrete and the transform option "delta", then ran it.

    fitTransformGeiger<-function(tree,X,model,interval, transform){
    lk<-function(par,tree,X,model){
    sum(apply(X,2,function(x,tree,model,par) {
    x <- factor(x)
    levels(x) <- c(1,2,3, levels(x))
    logLik(fitDiscrete(tree, x, model, transform=transform))},
    tree=tree,model=model,par=par))
    }
    fit<-optimize(lk,interval,tree=tree,X=X,model=model,maximum=TRUE)
    return(list(par=fit$maximum,model=model,logLik=fit$objective))
    }

    fitTransformGeiger(sppTree2, PreZ, model="ER", transform="delta", interval=c(0,1))

    And had the par, logLik and model returned to me. But what about those other values? Am I perhaps unknowingly asking a question that does not have a simple answer?

    --Ben

    ReplyDelete
    Replies
    1. Hi Ben.

      The tree transformation only has a single parameter. When we fit an EB model to continuous trait data we have three parameters - the tree transformation parameter, the rate of evolution, and the state at the root node. For discrete characters this is not the case any longer. In that case, we have the branch-length transformation parameter & (for each character separately, in your case) an instantaneous transition matrix, Q. To obtain these after your optimization is completed, you can just transform your tree by the jointly estimated delta, and then you can iterate across the columns of your character matrix and each time estimate a transition matrix, Q, under your chosen substitution model.

      Let me know if this makes sense.

      All the best, Liam

      Delete
    2. Hello Liam,

      After digesting all this information for a day, yes, it now makes sense.

      Thanks for the help,
      Ben

      Delete