Monday, September 23, 2013

Getting Robinson-Foulds distances for a set of trees

The phangorn function RF.dist - for computing the Robinson-Foulds distance between two trees - is very fast. However if we want to compare a set of trees stored in an object of class "multiPhylo" there is some opportunity to recycle internal calculations & thus improve on just calling RF.dist n × (n - 1) / 2 times for n trees. Specifically, here is some code that scales to multiple trees a little bit better than n × (n - 1) / 2 RF.dist by calling the ape function prop.part (which finds all bipartitions in the tree) only once for each tree:

multiRF<-function(trees){
 if(class(trees)!="multiPhylo")
  stop("trees should be an object of class \"multiPhylo\"")
 N<-length(trees)
 RF<-matrix(0,N,N)
 if(any(sapply(unclass(trees),is.rooted))){
  cat("Some trees are rooted. Unrooting all trees.\n")
  trees<-lapply(unclass(trees),unroot)
 }
 foo<-function(pp) lapply(pp,function(x,pp)
  sort(attr(pp,"labels")[x]),pp=pp)
 xx<-lapply(unclass(trees),function(x) foo(prop.part(x)))
 for(i in 1:(N-1)) for(j in (i+1):N)
  RF[i,j]<-RF[j,i]<-2*sum(!xx[[i]]%in%xx[[j]])
 RF
}

It works OK compared to calling RF.dist a whole bunch of times independently. First, here is an alternate version of the function which uses RF.dist:

multiRF2<-function(trees){
 N<-length(trees)
 RF<-matrix(0,N,N)
 for(i in 1:(N-1)) for(j in (i+1):N)
  RF[i,j]<-RF[j,i]<-RF.dist(trees[[i]],trees[[j]])
 RF
}

Now, a very simple comparison with random trees:

> N<-100
> n<-100
> trees<-rmtree(N,n,rooted=FALSE)
> system.time(X1<-multiRF(trees))
   user  system elapsed
   9.16    0.03    9.19
> system.time(X2<-multiRF2(trees))
   user  system elapsed
  19.98    0.00   19.99
> all(X1==X2)
[1] TRUE

7 comments:

  1. On a slightly faster desktop computer than the one used for this post, 1,000 100-taxon trees took about 6 min. to run. All pairwise using RF.dist is still running.

    ReplyDelete
  2. Hi Liam,
    I will add this into phangorn's RF.dist function, if you don mind. I need just to add a check if the arguments are of class "phylo" or "multiPhylo", so it could be done with one function.
    A possible problem is that one may runs out of memory. The results of prop.part can be reasonable big for trees with many taxa.
    Regards,
    Klaus

    ReplyDelete
    Replies
    1. Sure, Klaus. I have added it to a non-CRAN version of phytools (here), but I agree that it makes sense to add the handling of multiple trees to RF.dist. I'm sure that you can improve on the general principal of recycling the bipartitions rather than recalculating for each pair.

      Can you suggest some way to check (solely on the basis of the number of trees & Nnode in each tree) if there will be enough memory to store the result from prop.part for all trees? You could perhaps use object.size on prop.part from one tree & then check against memory.limit? I've never done that in any function of phytools.

      -- Liam

      Delete
  3. Hi All,

    Thanks for pursuing this. Fantastically useful for what I'm working on (and a lot of other people too I bet).

    For what it's worth, a great PhD student just put me onto the distory package, which has a 'dist.multiPhylo' function that does something related.

    Cheers,

    Rob

    ReplyDelete
    Replies
    1. Hi Rob.

      Looks like dist.multiPhylo computes geodesic distance; which is a phylogenetic tree distance that takes branch lengths into account. Unfortunately, this method is very slow (much slower than multiRF).

      - LIam

      Delete
  4. I am finding the process to create the multiphylo object a little less than intuitive. Any hints?

    ReplyDelete
    Replies
    1. I figured out how to create the multiphylo object, read in the trees individually and then combined them.... Ran the RF.dist and it seems to require the same number of tips in all trees. Is there a way around this? 73 tips in the concatenated tree, with most trees for individual loci being near to 70.

      Delete