Friday, June 23, 2017

Fix to read.newick for cases in which some but not all nodes are labeled

It was recently reported that there can be a mismatch between the number of nodes & the length of the node label vector in an object class "phylo" produced by the function read.newick in phytools. This appears to be a bug that is produced when some nodes have labels, but the node with the largest-valued index does not have a label. I just pushed a fix to GitHub & it appears to work. Here's a demo using the case submitted on R-sig-phylo:

library(phytools)
## Loading required package: ape
## Loading required package: maps
packageVersion("phytools")
## [1] '0.6.17'
tree<-read.newick(text='((((((a))A),(((b),(b1)))B)))C;')
tree$edge.length<-rep(1,nrow(tree$edge))
plotTree.singletons(tree)
nodelabels(tree$node.label)

plot of chunk unnamed-chunk-1

That's it.

Wednesday, June 21, 2017

Generating a set of random resolutions of all the nodes in a tree with multifurcations

Today, an R-sig-phylo list member asked:

“I am using the ape package to randomly resolve polytomies using 'multi2di' and wondering if there is a way to use this function to get a single output tree file that contains multiple different randomly resolved trees using some number of resamplings?”

This can be done using the phytools function resolveNode, which returns all possible resolutions of a given node. We just iterate over all the nodes of the tree pick a random resolution each time.

Here's a function to do that:

resolveRandom<-function(tree){
    while(!is.binary(tree)){
        nodes<-1:tree$Nnode+Ntip(tree)
        Nchildren<-function(node,tree) length(Children(tree,node))
        nchilds<-sapply(nodes,Nchildren,tree=tree)
        node<-nodes[which(nchilds>2)[1]]
        tree<-sample(resolveNode(tree,node),1)[[1]]
    }
    tree
}

Here's a quick demo:

library(phytools)
library(phangorn)
tree<-read.tree(text="(((A1,A2),(B1,B2,B3),C,D),E,F);")
plotTree(tree,type="cladogram",nodes="centered")

plot of chunk unnamed-chunk-2

## now some random resolutions
plotTree(resolveRandom(tree),type="cladogram",nodes="centered")

plot of chunk unnamed-chunk-2

plotTree(resolveRandom(tree),type="cladogram",nodes="centered")

plot of chunk unnamed-chunk-2

plotTree(resolveRandom(tree),type="cladogram",nodes="centered")

plot of chunk unnamed-chunk-2

plotTree(resolveRandom(tree),type="cladogram",nodes="centered")

plot of chunk unnamed-chunk-2

We can also generate a set of resolutions using replicate:

trees<-replicate(36,resolveRandom(tree),simplify=FALSE)
class(trees)<-"multiPhylo"
par(mfrow=c(6,6))
plotTree(trees,type="cladogram",nodes="centered")

plot of chunk unnamed-chunk-3

That's it.

Sunday, June 11, 2017

ML estimation of ancestral states for large trees using ace

A phytools user recently contacted me to report differences between several different functions in R to estimate ancestral states at internal nodes for continuous characters using likelihood.

Since I had no idea to expect this, I ran a quick comparison of ace(method="ML"), fastAnc, and anc.ML. Ostensibly, all three methods are performing ML estimation assuming a Brownian model of evolutionary change. The only difference is the method of optimization. fastAnc, for example, using a 're-rooting method' in which the tree is re-rooted at every internal node and the contrasts algorithm is used to obtain the ML state for that node. anc.ML, also in phytools, uses numerical optimization - but initiates the search using the values from fastAnc.

It occurred to me immediately that the problem may be related not to the implementation of the model, but rather to the simple issue of numerical optimization. Here is a demo suggesting that this could indeed be the case:

library(phytools)

## tiny example:
tree<-pbtree(n=26,tip.label=LETTERS)
x<-fastBM(tree)
fit1<-ace(x,tree,type="continuous",method="ML")
fit2<-fastAnc(tree,x,vars=TRUE,CI=TRUE)
fit3<-anc.ML(tree,x)
obj<-cbind(fit1$ace,fit2$ace,fit3$ace)
colnames(obj)<-c("ace(method=\"ML\")","fastAnc","anc.ML")
pairs(obj,pch=21,bg="grey",cex=1.5)

plot of chunk unnamed-chunk-1

## small example
tree<-pbtree(n=100)
x<-fastBM(tree)
fit1<-ace(x,tree,type="continuous",method="ML")
fit2<-fastAnc(tree,x,vars=TRUE,CI=TRUE)
fit3<-anc.ML(tree,x)
obj<-cbind(fit1$ace,fit2$ace,fit3$ace)
colnames(obj)<-c("ace(method=\"ML\")","fastAnc","anc.ML")
pairs(obj,pch=21,bg="grey",cex=1.5)

plot of chunk unnamed-chunk-1

## medium example
tree<-pbtree(n=200)
x<-fastBM(tree)
fit1<-ace(x,tree,type="continuous",method="ML")
fit2<-fastAnc(tree,x,vars=TRUE,CI=TRUE)
fit3<-anc.ML(tree,x)
obj<-cbind(fit1$ace,fit2$ace,fit3$ace)
colnames(obj)<-c("ace(method=\"ML\")","fastAnc","anc.ML")
pairs(obj,pch=21,bg="grey",cex=1.5)

plot of chunk unnamed-chunk-1

## large example:
tree<-pbtree(n=1000)
x<-fastBM(tree)
fit1<-ace(x,tree,type="continuous",method="ML")
## Warning in sqrt(diag(solve(h))): NaNs produced
fit2<-fastAnc(tree,x,vars=TRUE,CI=TRUE)
fit3<-anc.ML(tree,x)
obj<-cbind(fit1$ace,fit2$ace,fit3$ace)
colnames(obj)<-c("ace(method=\"ML\")","fastAnc","anc.ML")
pairs(obj,pch=21,bg="grey",cex=1.5)

plot of chunk unnamed-chunk-1

What we should see is that the results are identical for 'tiny' and relatively modest-sized trees, but go awry for large & possibly medium-sized trees where the number of parameters (the states at all the nodes) really explodes. This might suggest that the problem is with numerical optimization on large trees, rather than an issue with how the model is implemented or how the likelihood computed.

This should serve as a reminder that any method depending on numerical optimization is only as good as its optimization routine.