Phylogenetic Tools for Comparative Biology: Getting Robinson-Foulds distances for a set of trees

Monday, September 23, 2013

Getting Robinson-Foulds distances for a set of trees

The phangorn function RF.dist - for computing the Robinson-Foulds distance between two trees - is very fast. However if we want to compare a set of trees stored in an object of class "multiPhylo" there is some opportunity to recycle internal calculations & thus improve on just calling RF.dist n × (n - 1) / 2 times for n trees. Specifically, here is some code that scales to multiple trees a little bit better than n × (n - 1) / 2 RF.dist by calling the ape function prop.part (which finds all bipartitions in the tree) only once for each tree:

multiRF<-function(trees){
if(class(trees)!="multiPhylo")
  stop("trees should be an object of class \"multiPhylo\"")
N<-length(trees)
RF<-matrix(0,N,N)
if(any(sapply(unclass(trees),is.rooted))){
  cat("Some trees are rooted. Unrooting all trees.\n")
  trees<-lapply(unclass(trees),unroot)
}
foo<-function(pp) lapply(pp,function(x,pp)
  sort(attr(pp,"labels")[x]),pp=pp)
xx<-lapply(unclass(trees),function(x) foo(prop.part(x)))
for(i in 1:(N-1)) for(j in (i+1):N)
  RF[i,j]<-RF[j,i]<-2*sum(!xx[[i]]%in%xx[[j]])
RF
}

It works OK compared to calling RF.dist a whole bunch of times independently. First, here is an alternate version of the function which uses RF.dist:

multiRF2<-function(trees){
N<-length(trees)
RF<-matrix(0,N,N)
for(i in 1:(N-1)) for(j in (i+1):N)
RF[i,j]<-RF[j,i]<-RF.dist(trees[[i]],trees[[j]])
RF
}

Now, a very simple comparison with random trees:

> N<-100
> n<-100
> trees<-rmtree(N,n,rooted=FALSE)
> system.time(X1<-multiRF(trees))
user system elapsed
9.16 0.03 9.19
> system.time(X2<-multiRF2(trees))
user system elapsed
19.98 0.00 19.99
> all(X1==X2)
[1] TRUE

7 comments:

Liam RevellSeptember 23, 2013 at 5:07 PM
On a slightly faster desktop computer than the one used for this post, 1,000 100-taxon trees took about 6 min. to run. All pairwise using RF.dist is still running.
ReplyDelete
Replies
KlausSeptember 24, 2013 at 3:05 PM
Hi Liam,
I will add this into phangorn's RF.dist function, if you don mind. I need just to add a check if the arguments are of class "phylo" or "multiPhylo", so it could be done with one function.
A possible problem is that one may runs out of memory. The results of prop.part can be reasonable big for trees with many taxa.
Regards,
Klaus

ReplyDelete
Replies
Rob LanfearSeptember 25, 2013 at 2:50 AM
Hi All,

Thanks for pursuing this. Fantastically useful for what I'm working on (and a lot of other people too I bet).

For what it's worth, a great PhD student just put me onto the distory package, which has a 'dist.multiPhylo' function that does something related.

Cheers,

Rob
ReplyDelete
Replies
cbuddenhagenMay 29, 2014 at 10:56 PM
I am finding the process to create the multiphylo object a little less than intuitive. Any hints?
ReplyDelete
Replies

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.

Pages

Monday, September 23, 2013

Getting Robinson-Foulds distances for a set of trees

7 comments: