Friday, April 24, 2015

Finding the youngest node(s) with N or more descendant tips

Today I was asked the following question (slightly paraphrased):

“How do I indentify the node defining the youngest subtree with N (for arbitrary N, in his case it was 5) descendant taxa?”

This is pretty easy. For demonstrative purposes, let's first simulate a tree:

library(phytools)
## simulate tree
tree<-pbtree(n=26,tip.label=LETTERS)

Now let's count the number of tips descended from each internal node of the tree:

## number of descendant taxa by node
N<-setNames(sapply(1:tree$Nnode+Ntip(tree),
    function(x,tree) sum(getDescendants(tree,x)<=Ntip(tree)),tree=tree),
    1:tree$Nnode+Ntip(tree))
## we can plot them to make sure we got it right
plotTree(tree,mar=c(2.1,0.1,0.1,0.1))
nodelabels(N)

plot of chunk unnamed-chunk-2

Next, we can compute the depth of each node. We can do this in more than one way. Here, I have an sapply call around nodeheight, but for a larger tree I would use one call to nodeHeights.

## node depths
d<-setNames(max(nodeHeights(tree))-
    sapply(1:tree$Nnode+Ntip(tree),nodeheight,tree=tree),
    1:tree$Nnode+Ntip(tree))
## again, let's check them by plotting
plotTree(tree,mar=c(2.1,0.1,0.1,0.1))
nodelabels(round(d,2))
axis(1,at=round(0:5/5*max(nodeHeights(tree)),2))

plot of chunk unnamed-chunk-3

Finally, let's find the most recent node (or nodes) with n=5 or more descendant tips:

## which node is the youngest node to have n=5 or more 
## descendant taxa
n<-5
node<-as.numeric(names(which(d[N>=n]==min(d[N>=n]))))

Let's check it visually:

plotTree(tree,mar=c(2.1,0.1,0.1,0.1))
axis(1,at=round(0:5/5*max(nodeHeights(tree)),2))
ii<-which(names(d)==node)
lines(max(nodeHeights(tree))-c(d[ii],d[ii]),
    y=par()$usr[3:4],lty="dashed",col="red")
nodelabels(node=node,"X")

plot of chunk unnamed-chunk-5

That's it!

Monday, April 20, 2015

Small bug fix for collapseTree, plotSimmap, plotTree

I just posted a new version of the phytools functin collapseTree. This function is an interactive function for collapsing (& expanding) subtrees on a phylogeny.

This version fixes two bugs in earlier versions. Firstly, prior versions crashed if the user clicked on the root node (which is an attempt to collapse the tree into a single node). This is still not permitted, but the result is that nothing happens and a message is printed. Secondly, prior versions also crashed under certain conditions when the tree was collapsed into two tips. This is actually due to a bug in how the environmental variable "last_plot.phylo" was created. This bug is now fixed in both collapseTree and plotSimmap / plotTree.

Since this function creates an animation there is no point in try to illustrate it here, but the following shows a video of it in use, with these bugs fixed. It also may seem smoother because it is being plotted user a faster machine than before.

This is also in a new non-CRAN version of phytools, and I am hoping to submit an update of phytools to CRAN soon.

Tuesday, April 14, 2015

Phylogenetic regression when branch lengths are unknown: A few different scenarios

In a recent R-sig-phylo discussion a user asked if it would be reasonable to use branches sampled assuming a Yule process in phylogenetic regression under conditions in which the branch lengths of the tree are unknown. Although I supplied a function designed to sample branching times on an arbitrary topology given this process, I could not answer the question of whether it was a good idea (or not) to use Yule process edge lengths when the topology was known, but branch lengths are not, even if it is reasonable to assume that the tree arose under a pure-birth process.

Here, I attempt to briefly explore that question.

First, my function to sample edges lengths under a Yule process (from last time):

yuleEdges<-function(tree,b=1,plot=TRUE,...){
    ll<-rexp(n=Ntip(tree)-1,rate=2:Ntip(tree)*b)
    tree$edge.length<-rep(0,nrow(tree$edge))
    live.nodes<-Descendants(tree,Ntip(tree)+1,"children")
    tips<-vector()
    for(i in 1:length(ll)){
        tips<-c(tips,live.nodes[live.nodes<=Ntip(tree)])
        live.nodes<-setdiff(live.nodes,tips)
        ii<-which(tree$edge[,2]%in%c(live.nodes,tips))
        tree$edge.length[ii]<-tree$edge.length[ii]+ll[i]
        node<-if(length(live.nodes)<=1) live.nodes else 
            sample(live.nodes,1) ## choose one node
        live.nodes<-c(setdiff(live.nodes,node),
            Descendants(tree,node,"children"))
        if(plot) plotTree(tree,...)
    }
    tree
}

Now, let's simulate some data. First, let's try uncorrelated data. We can thus explore bias & variance in the estimate of the contrasts/PGLS regression slope (it should be zero), as well as type I error.

## load libraries
library(phytools)
library(phangorn)
library(ape)
## simulate 200 pure-birth trees:
trees<-ttrees<-pbtree(n=50,scale=1,nsim=200)
## ttrees contains the trees with their original branch lengths
## we'll perform manipulations on trees
foo<-function(tree){
    obj<-fastBM(tree,nsim=2)
    colnames(obj)<-c("x","y")
    as.data.frame(obj)
}
xy<-lapply(ttrees,foo)

Next, as a control for our experiment, let's fit contrasts regressions (PGLS,here) to each of these, using the known, true branch lengths. We'll just pull out the slope, β1 and the P-value for each regression:

library(nlme)
fit.model<-function(tree,data){
    data$v<-diag(vcv.phylo(tree))
    fit<-gls(y~x,data=data,correlation=corBrownian(1,tree),
        weights=varFixed(~v))
    setNames(c(coefficients(fit)[2],anova(fit)$"p-value"[2]),
        c("beta","p-value"))
}
fit.true<-t(mapply(fit.model,trees,xy))
mean(fit.true[,1]) ## should be zero
## [1] -0.005196606
## should be uniform on [0,1]
hist(fit.true[,2],freq=FALSE,xlab="P-value",
    main="P-values, known branch lengths") 

plot of chunk unnamed-chunk-3

mean(fit.true[,2]<=0.05) ## should be about 0.05
## [1] 0.035

OK, now imagine we are in a situation without branch lengths. We're going to consider a few different possibilities:

(1) All branch lengths set equal to 1.0.

(2) Grafen's (1989) branch lengths.

(3) Branch lengths randomly sampled under a Yule process.

First (1), setting all branch lengths to zero:

foo<-function(tree){
    tree$edge.length<-rep(1,nrow(tree$edge))
    tree
}
trees<-lapply(ttrees,foo)
class(trees)<-"multiPhylo"
fit.equal<-t(mapply(fit.model,trees,xy))
mean(fit.equal[,1]) ## should be zero
## [1] -0.006076158
## should be uniform on [0,1]
hist(fit.equal[,2],freq=FALSE,xlab="P-value",
    main="P-values, all branch lengths 1.0")

plot of chunk unnamed-chunk-4

mean(fit.equal[,2]<=0.05) ## should be 0.05
## [1] 0.125

Now (2), using Grafen's (1989) branch lengths:

trees<-lapply(ttrees,compute.brlen)
class(trees)<-"multiPhylo"
fit.grafen<-t(mapply(fit.model,trees,xy))
mean(fit.grafen[,1]) ## should be zero
## [1] -0.007811392
## should be uniform on [0,1]
par(mar=c(5.1,4.1,4.1,2.1))
hist(fit.grafen[,2],freq=FALSE,xlab="P-value",
    main="P-values, Grafen edge lengths") 

plot of chunk unnamed-chunk-5

mean(fit.grafen[,2]<=0.05) ## should be 0.05
## [1] 0.125

Finally, (3), branch lengths sampled under a Yule process. Now for this, rather than using one set of branch lengths, for each tree we should simulate a set of branch lengths and then average our inference over this set. This is logical, because any individual set of branch lengths will be wrong, but perhaps by computing the variance among our estimated parameter and adding it to the mean variance of each estimate (under the law of total variance) should give us the correct variance of our estimator. Let's do this:

## this function simulates 100 sets of edge lengths, fits the model
## to each of them using PGLS, and extracts the coefficient & variance
## then averages the variance across trees, adds this to the variance
## among trees, and computes a p-value for the parameter from the 
## t-distribution
yuleApply<-function(tree,data,nrep=100){
    trees<-replicate(nrep,yuleEdges(tree,plot=FALSE),simplify=FALSE)
    class(trees)<-"multiPhylo"
    gls.fit<-function(tree,data){
        obj<-gls(y~x,data=data,correlation=corBrownian(1,tree))
        setNames(c(coefficients(obj)[2],obj$varBeta[2,2]),
            c("beta","varBeta"))
    }
    fit<-t(sapply(trees,gls.fit,data=data))
    b<-mean(fit[,"beta"])
    v<-var(fit[,"beta"])+mean(fit[,"varBeta"])
    p<-2*pt(abs(b/sqrt(v)),df=Ntip(tree)-1,lower.tail=FALSE)
    setNames(c(b,v,p),c("beta","varBeta","p-value"))
}
fit.yule<-t(mapply(yuleApply,ttrees,xy))
mean(fit.yule[,1]) ## should be zero
## [1] -0.03133003
## should be uniform on [0,1]
hist(fit.yule[,3],freq=FALSE,xlab="P-value",
    main="P-values, Yule branch lengths")

plot of chunk unnamed-chunk-6

mean(fit.yule[,3]<=0.05) ## should be 0.05
## [1] 0

So, unless I messed something up here, it looks as though simulating edge lengths under a Yule process, though unbiased, is excessively conservative.

Next, let's try the situation in which x and y are genuinely correlated. Again, we start with the data generation, say with β1 = 0.4:

foo<-function(tree,beta,vare){
    x<-fastBM(tree)
    e<-fastBM(tree,sig2=vare)
    y<-beta*x+e
    data.frame(x,y)
}
xy<-lapply(ttrees,foo,beta=0.4,vare=1.2)

First, with the true edge lengths:

fit.true<-t(mapply(fit.model,trees,xy))
mean(fit.true[,1]) ## should be 0.4
## [1] 0.3856536
## this is power now, rather than type I error
mean(fit.true[,2]<=0.05)
## [1] 0.615

Now all edge lengths to 1.0:

foo<-function(tree){
    tree$edge.length<-rep(1,nrow(tree$edge))
    tree
}
trees<-lapply(ttrees,foo)
class(trees)<-"multiPhylo"
fit.equal<-t(mapply(fit.model,trees,xy))
mean(fit.equal[,1]) ## should be 0.4
## [1] 0.3946721
mean(fit.equal[,2]<=0.05) ## power
## [1] 0.605

Now Grafen's edge lengths with compute.brlen:

trees<-lapply(ttrees,compute.brlen)
class(trees)<-"multiPhylo"
fit.grafen<-t(mapply(fit.model,trees,xy))
mean(fit.grafen[,1]) ## should be 0.4
## [1] 0.3856536
mean(fit.grafen[,2]<=0.05) ## power
## [1] 0.615

Now, branching times sampled under a Yule process, as before:

fit.yule<-t(mapply(yuleApply,ttrees,xy))
mean(fit.yule[,1]) ## should be 0.4
## [1] 0.3870531
mean(fit.yule[,3]<=0.05) ## power
## [1] 0.145

So, there you have it. Although it would seem to be the case that sampling branching times under which the known true branches arose, a Yule process, would be a good strategy when the branch lengths are unknown - this has the effect of leading to a type I error rate considerably below the nominal rate, as well as to low power relative to other kinds of arbitrary branch lengths. Who knew!

Monday, April 13, 2015

Sampling edge lengths under a Yule process

There has been a little bit of discussion today on R-sig-phylo listserve about transforming branch lengths.

One thing that wasn't mentioned was the possibility of sampling branch lengths under a model. I thought it would be straightforward to sample branch lengths under a Yule model (that is, a pure-birth speciation model).

The following is code that does this. (Set plot=TRUE to see a cool animation of the tree being 'grown' from left to right.)

pb_edgelength<-function(tree,b=1,plot=TRUE,...){
    ll<-rexp(n=Ntip(tree)-1,rate=2:Ntip(tree)*b)
    tree$edge.length<-rep(0,nrow(tree$edge))
    live.nodes<-Descendants(tree,Ntip(tree)+1,"children")
    tips<-vector()
    for(i in 1:length(ll)){
        tips<-c(tips,live.nodes[live.nodes<=Ntip(tree)])
        live.nodes<-setdiff(live.nodes,tips)
        ii<-which(tree$edge[,2]%in%c(live.nodes,tips))
        tree$edge.length[ii]<-tree$edge.length[ii]+ll[i]
        node<-if(length(live.nodes)<=1) live.nodes else 
            sample(live.nodes,1) ## choose one node
        live.nodes<-c(setdiff(live.nodes,node),
            Descendants(tree,node,"children"))
        if(plot) plotTree(tree,...)
    }
    tree
}

OK, now let's try it out with a tree obtained using rtree from the ape package:

library(phytools)
library(phangorn)
tree<-rtree(n=100,br=NULL) ## no branch lengths
t.pb<-pb_edgelength(tree,plot=FALSE)
plotTree(t.pb,ftype="off")

plot of chunk unnamed-chunk-2

par(mar=c(5.1,4.1,2.1,2.1))
obj<-ltt(t.pb)

plot of chunk unnamed-chunk-2

obj$gamma
## [1] -0.1807099

Compare this to Grafen's edge lengths from compute.brlen:

t.grafen<-compute.brlen(tree)
plotTree(t.grafen,ftype="off")

plot of chunk unnamed-chunk-3

par(mar=c(5.1,4.1,2.1,2.1))
obj<-ltt(t.grafen)

plot of chunk unnamed-chunk-3

obj$gamma
## [1] 6.732706

Obviously, the branching times from Grafen's branch length transformation are very different from those obtained under a Yule process!

That's it, really.