Saturday, September 23, 2017

AIC method for "ratebytree" object class

I just pushed a few small fixes plus a new S3 AIC method for "ratebytree" class objects.

This is what the function looks like:

AIC.ratebytree<-function(object,...,k=2){
    aic<-data.frame(AIC=c(k*object$common.rate.model$k-2*object$common.rate.model$logL,
        k*object$multi.rate.model$k-2*object$multi.rate.model$logL),
        df=c(object$common.rate.model$k,object$multi.rate.model$k))
    addtl.obj<-list(...)
    if(length(addtl.obj)>0){
        for(i in 1:length(addtl.obj)) aic<-rbind(aic,
            c(k*addtl.obj[[i]]$multi.rate.model$k-2*addtl.obj[[i]]$multi.rate.model$logL,
            addtl.obj[[i]]$multi.rate.model$k))
        rownames(aic)<-c("common-rate",paste("multi-rate:",1:(length(addtl.obj)+1),sep=""))
    } else rownames(aic)<-c("common-rate","multi-rate")
    aic
}

I'm not totally sure this is legal, but what the method does if multiple objects are supplied is that it assumes the common-rate model is the same for all supplied objects, and then it computes AIC values for only the multiple rate (or process) model for all subsequent objects.

Here's a demo using the same tree & data from earlier today:

library(phytools)
## our data:
par(mfrow=c(1,3))
nulo<-mapply(phenogram,tree=trees,x=x,MoreArgs=list(ylim=range(x),
    ftype="off"))

plot of chunk unnamed-chunk-2

## fit two-rate model
fit.reg<-ratebytree(trees,x,regimes=c(1,2,1))
fit.reg
## ML common-rate model:
##          s^2     a[1]    a[2]    a[3]    k   logL
## value    1.2117  0.7489  0.058   1.0751  4   -223.603    
## SE       0.1399  0.8292  1.1014  1.2046  
## 
## ML multi-rate model:
##          s^2[1]  s^2[2]  a[1]    a[2]    a[3]    k   logL
## value    0.8862  2.1068  0.7489  0.058   1.0751  5   -217.4606   
## SE       0.1195  0.4711  0.7091  1.4522  1.0302  
## 
## Likelihood ratio: 12.285 
## P-value (based on X^2): 5e-04 
## 
## R thinks it has found the ML solution.
## fit all rates different model
fit.all<-ratebytree(trees,x)
fit.all
## ML common-rate model:
##          s^2     a[1]    a[2]    a[3]    k   logL
## value    1.2117  0.7489  0.058   1.0751  4   -223.603    
## SE       0.1399  0.8292  1.1014  1.2046  
## 
## ML multi-rate model:
##          s^2[1]  s^2[2]  s^2[3]  a[1]    a[2]    a[3]    k   logL
## value    0.6709  2.1068  1.0656  0.7489  0.058   1.0751  6   -216.0332   
## SE       0.1342  0.4711  0.1946  0.617   1.4523  1.1297  
## 
## Likelihood ratio: 15.1397 
## P-value (based on X^2): 5e-04 
## 
## R thinks it has found the ML solution.
## now compute AIC scores:
AIC(fit.reg,fit.all)
##                   AIC df
## common-rate  455.2061  4
## multi-rate:1 444.9211  5
## multi-rate:2 444.0664  6

In this case, the all-rates-different model has the lowest AIC score, but the difference in fit between that & the two-rate model is quite small. Neat.

ratebytree for comparing rate (and process) of evolution between trees, now with user control of regimes

As I've mentioned in prior posts (1, 2, 3, 4, 5, 6), I'm currently in the midst of revising a manuscript describing the phytools function ratebytree.

In its original incarnation, this function was designed for the relatively simple task of comparing the Brownian rate of evolution between trees. More specifically, to fit two models: one in which each tree was permitted to have a different rate; and a second in which all trees had the same rate. In this form it was a very simple modification of O'Meara et al.'s (2006) censored model.

The editor & reviewers requested a number of new things, I have been diligently working to add these (summarized here). The final of these was to permit the user to specify different process regimes for different subsets of the trees in our full set. For instance, if we have 4 trees, perhaps we'd like to fit a model in which the first three have one rate, and the fourth a second.

Although this is genuinely no more complex than what I was doing already, at first I was befuddled as to how to do this without writing a completely new function to do it. Then it occurred to me that a multi-rate model in which some sets of trees have the same rate (and other sets, a different rate or different rates), is really no different from fitting a series of common-rate models and then summing the likelihood. (In practice, what I actually did was write a multi-rate function that computes the likelihood by summing multiple calls to the existing common rate function.)

I just pushed this update to GitHub.

Here I'll demonstrate it:

library(phytools)
packageVersion("phytools")
## [1] '0.6.28'
trees
## 3 phylogenetic trees

First, I'll fit a set of Brownian models in which trees 1, 2, & 3 all have different rates for trait x; and a model in which 1 & 3 have the same rate:

## our data:
par(mfrow=c(1,3))
nulo<-mapply(phenogram,tree=trees,x=x,MoreArgs=list(ylim=range(x),
    ftype="off"))

plot of chunk unnamed-chunk-2

## fit models
fit.all<-ratebytree(trees,x)
fit.all
## ML common-rate model:
##          s^2     a[1]    a[2]    a[3]    k   logL
## value    1.2117  0.7489  0.058   1.0751  4   -223.603    
## SE       0.1399  0.8292  1.1014  1.2046  
## 
## ML multi-rate model:
##          s^2[1]  s^2[2]  s^2[3]  a[1]    a[2]    a[3]    k   logL
## value    0.6709  2.1068  1.0656  0.7489  0.058   1.0751  6   -216.0332   
## SE       0.1342  0.4711  0.1946  0.617   1.4523  1.1297  
## 
## Likelihood ratio: 15.1397 
## P-value (based on X^2): 5e-04 
## 
## R thinks it has found the ML solution.
fit.reg<-ratebytree(trees,x,regimes=c(1,2,1))
fit.reg
## ML common-rate model:
##          s^2     a[1]    a[2]    a[3]    k   logL
## value    1.2117  0.7489  0.058   1.0751  4   -223.603    
## SE       0.1399  0.8292  1.1014  1.2046  
## 
## ML multi-rate model:
##          s^2[1]  s^2[2]  a[1]    a[2]    a[3]    k   logL
## value    0.8862  2.1068  0.7489  0.058   1.0751  5   -217.4606   
## SE       0.1195  0.4711  0.7091  1.4522  1.0302  
## 
## Likelihood ratio: 12.285 
## P-value (based on X^2): 5e-04 
## 
## R thinks it has found the ML solution.

Note that I have one fewer rate in the “multi-rate” model, but not one fewer ancestral state (a[1], a[2], and so one).

Next, I'll do the same thing - but using an OU model & character trait y as follows. Note that I can specify regime names arbitrarily. Before I just used 1 and 2, but I can equally well use A and B as follows. Since the ancestral states correspond to trees, not regimes, they remain numbered 1, 2, and so on.

## our data:
par(mfrow=c(1,3))
nulo<-mapply(phenogram,tree=trees,x=y,MoreArgs=list(ylim=range(y),
    ftype="off"))

plot of chunk unnamed-chunk-3

## fit models
fit.all<-ratebytree(trees,y,model="OU")
fit.all
## ML common-regime OU model:
##          s^2     a[1]    a[2]    a[3]    alpha   k   logL
## value    0.7577  1.9358  2.0289  1.4108  0.4931  5   -148.1179   
## SE       0.1458  0.23    0.3051  0.2546  0.1635  
## 
## ML multi-regime OU model:
##          s^2[1]  s^2[2]  s^2[3]  a[1]    a[2]    a[3]    alp[1]  alp[2]  alp[3]  k   logL
## value    0.5385  0.8383  1.1763  1.9196  1.7908  1.4349  1.1558  -0.016  1.7167  9   -126.9808   
## SE       0.1952  0.1875  0.4359  0.0919  0.9161  0.0939  0.4753  0       0.6696  
## 
## Likelihood ratio: 42.2742 
## P-value (based on X^2): 0 
## 
## R thinks it has found the ML solution.
fit.reg<-ratebytree(trees,y,model="OU",regimes=c("A","B","A"))
fit.reg
## ML common-regime OU model:
##          s^2     a[1]    a[2]    a[3]    alpha   k   logL
## value    0.7577  1.9359  2.0289  1.4109  0.4931  5   -148.1179   
## SE       0.1458  0.23    0.3051  0.2546  0.1635  
## 
## ML multi-regime OU model:
##          s^2[A]  s^2[B]  a[1]    a[2]    a[3]    alp[A]  alp[B]  k   logL
## value    0.8731  0.8383  1.9154  1.7908  1.4312  1.4913  -0.0046 7   -128.6895   
## SE       0.23    0.1875  0.0957  0.9161  0.0906  0.4256  0   
## 
## Likelihood ratio: 38.8568 
## P-value (based on X^2): 0 
## 
## R thinks it has found the ML solution.

The data were simulated as follows for this example:

trees<-c(pbtree(n=50),pbtree(n=40),pbtree(n=60))
x<-list(fastBM(trees[[1]],sig2=1),
    fastBM(trees[[2]],sig2=2),
    fastBM(trees[[3]],sig2=1))
y<-list(fastBM(trees[[1]],model="OU",alpha=2),
    fastBM(trees[[2]]),
    fastBM(trees[[3]],model="OU",alpha=2))

I haven't implemented this for discrete characters yet, but this will have to be my next project!

Thursday, September 21, 2017

Simulating trees with specific values of Pybus & Harvey's γ

A colleague recently asked me the following question titled:

“Simulating trees with different pybus & Harvey's γ.”

“This may be a question with a very obvious answer, but is there a way of simulating Yule trees so they are either early burst or late burst (i.e. can you create simulated trees with a high or low γ)? Trying to satisfy a reviewer’s request to test our taxonomy paper against null models.”

Well, we can't precisely simulate constant-rate pure-birth trees that have a particular γ value; however, one thing we could do is transform the edge lengths of the tree so that has a particular, predictable value of γ. In fact, it turns out to be the case - at least so far as I can tell (though I don't have a formal proof of this) that for any tree & particular value of γ, there is a corresponding “EB” (early-burst or ACDC, see Blomberg et al. 2003) edge-length transformation that will produce that very-same γ. It's just a matter of finding it!

Turns out - that's pretty easy. We can use numerical optimization, and since we only need to optimize in one-dimension, as the EB transformation only has one paramter, r, optimization is both fast & robust.

For instance:

library(phytools)
## simulate a tree
tree<-pbtree(n=100)
ltt(tree,show.tree=TRUE,log.lineages=F,log="y")

plot of chunk unnamed-chunk-1

## Object of class "ltt" containing:
## 
## (1) A phylogenetic tree with 100 tips and 99 internal nodes.
## 
## (2) Vectors containing the number of lineages (ltt) and branching times (times) on the tree.
## 
## (3) A value for Pybus & Harvey's "gamma" statistic of 2.0956, p-value = 0.0361.
## transform to gamma=0
ebTree<-phytools:::ebTree
gamma<-function(r,tree,g) 
    (g-ltt(ebTree(tree,r),plot=FALSE)$gamma)^2
fit<-optimize(gamma,c(-10,10),tree=tree,g=0,tol=1e-12)
ltt(ebTree(tree,fit$minimum),show.tree=TRUE,log.lineages=F,log="y",
    lwd=2)

plot of chunk unnamed-chunk-1

## Object of class "ltt" containing:
## 
## (1) A phylogenetic tree with 100 tips and 99 internal nodes.
## 
## (2) Vectors containing the number of lineages (ltt) and branching times (times) on the tree.
## 
## (3) A value for Pybus & Harvey's "gamma" statistic of 0, p-value = 1.
## or how about gamma=10
fit<-optimize(gamma,c(-10,10),tree=tree,g=10,tol=1e-12)
ltt(ebTree(tree,fit$minimum),show.tree=TRUE,log.lineages=F,log="y",
    lwd=2)

plot of chunk unnamed-chunk-1

## Object of class "ltt" containing:
## 
## (1) A phylogenetic tree with 100 tips and 99 internal nodes.
## 
## (2) Vectors containing the number of lineages (ltt) and branching times (times) on the tree.
## 
## (3) A value for Pybus & Harvey's "gamma" statistic of 10, p-value = 0.
## or gamma=-10
fit<-optimize(gamma,c(-10,10),tree=tree,g=-10,tol=1e-12)
ltt(ebTree(tree,fit$minimum),show.tree=TRUE,log.lineages=F,log="y",
    lwd=2)

plot of chunk unnamed-chunk-1

## Object of class "ltt" containing:
## 
## (1) A phylogenetic tree with 100 tips and 99 internal nodes.
## 
## (2) Vectors containing the number of lineages (ltt) and branching times (times) on the tree.
## 
## (3) A value for Pybus & Harvey's "gamma" statistic of -10, p-value = 0.

Note that the ebTree transformation, though it does produce the correct γ, has the side-effect of changing the total depth of the tree. This needn't be the case. For instance, we can just do the following:

## custom EB function to hold total length constant
EB<-function(tree,r){
    d<-max(nodeHeights(tree))
    tree<-ebTree(tree,r)
    tree$edge.length<-tree$edge.length/max(nodeHeights(tree))*d
    tree
}
gamma<-function(r,tree,g) 
    (g-ltt(EB(tree,r),plot=FALSE)$gamma)^2
## try with gamma=-4
fit<-optimize(gamma,c(-10,10),tree=tree,g=-4,tol=1e-12)
ltt(EB(tree,fit$minimum),show.tree=TRUE,log.lineages=F,log="y",
    lwd=2)

plot of chunk unnamed-chunk-2

## Object of class "ltt" containing:
## 
## (1) A phylogenetic tree with 100 tips and 99 internal nodes.
## 
## (2) Vectors containing the number of lineages (ltt) and branching times (times) on the tree.
## 
## (3) A value for Pybus & Harvey's "gamma" statistic of -4, p-value = 1e-04.

OK, now let's generate 100 trees with total length of 1.0 and γ=-1:

trees<-pbtree(n=100,scale=1,nsim=100)
foo<-function(tree,g)
    optimize(gamma,c(-10,10),tree=tree,g=g,tol=1e-12)$minimum
r<-sapply(trees,foo,g=-1)
ttrees<-mapply(EB,tree=trees,r=r,SIMPLIFY=FALSE)
class(ttrees)<-"multiPhylo"
g<-sapply(ttrees,function(x) ltt(x,plot=FALSE)$gamma)
g
##   [1] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
##  [24] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
##  [47] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
##  [70] -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
##  [93] -1 -1 -1 -1 -1 -1 -1 -1
ltt(ttrees,col=make.transparent("blue",0.25))

plot of chunk unnamed-chunk-1

## 100 objects of class "ltt" in a list

What's neat is that for our single fixed value of γ = -1.0, the LTT curves vary quite a lot!

[Note. In the original post I called ltt(trees,...) rather than ltt(ttrees,...) - that is, ltt on the original, rather than transformed trees. Shown above this is fixed, but the LTT plots still vary quite a bit.]

That's it. Hopefully of some help.