## Saturday, October 3, 2015

### New, reasonably fast method to compute the patristic distance between a pair of species in a large tree

A R-sig-phylo question yesterday read as follows (very slightly paraphrased):

“Given a large newick tree e.g. from the new open tree of life (304959 tips) I want to calculate a pairwise distance between a small subset of species. The tricks I've seen so far [Ed. `ape::cophenetic`, which computes a N × N patristic distance for a tree with N tips] do not work on a tree of that size. I want to determine single species to single species distances without having an algorithm attempt to calculate the distance for all possible species (which is what ape is doing). Is there a package/trick to accomplish this?”

As a matter of fact, this can be done in a reasonably efficient way using the phytools function `fastHeight`, which computes the height above the root of the MRCA of any pair of species. That is because, for two species i and j, the patristic distane between them is simply the sum of the heights above the root for species i and j minus two times the height above the root of the common ancestor of i & j.

To show that this works fairly well, let's try to write a little custom function, `fastDist`, to compute this distance using the heights as mentioned:

``````library(phytools)
fastDist<-function(tree,sp1,sp2){
fastHeight(tree,sp1,sp1)+fastHeight(tree,sp2,sp2)-
2*fastHeight(tree,sp1,sp2)
}
``````

Now, let's test it on a somewhat large tree:

``````## simulate tree at random
tree<-rtree(n=5000)
sp1<-sample(tree\$tip.label)[1]
sp1
``````
``````## [1] "t941"
``````
``````sp2<-sample(tree\$tip.label)[2]
sp2
``````
``````## [1] "t2799"
``````
``````## compute using cophenetic
system.time(d<-cophenetic(tree)[sp1,sp2])
``````
``````##    user  system elapsed
##    5.20    0.68    5.96
``````
``````d
``````
``````## [1] 11.34199
``````
``````## compute using fastDist
system.time(d<-fastDist(tree,sp1,sp2))
``````
``````##    user  system elapsed
##    0.05    0.00    0.04
``````
``````d
``````
``````## [1] 11.34199
``````

Having done that, let's go for broke with a tree of 304,959, just as described:

``````tree<-rtree(n=304959)
tree
``````
``````##
## Phylogenetic tree with 304959 tips and 304958 internal nodes.
##
## Tip labels:
##  t51258, t83258, t60186, t263493, t117091, t210153, ...
##
## Rooted; includes branch lengths.
``````
``````sp1<-sample(tree\$tip.label)[1]
sp1
``````
``````## [1] "t51585"
``````
``````sp2<-sample(tree\$tip.label)[2]
sp2
``````
``````## [1] "t244730"
``````
``````## just to check that our MRCA is not the root
fastMRCA(tree,sp1,sp2)
``````
``````## [1] 427604
``````
``````## compute using fastDist
system.time(d<-fastDist(tree,sp1,sp2))
``````
``````##    user  system elapsed
##    4.80    1.34    6.21
``````
``````d
``````
``````## [1] 12.93951
``````

Cool.

Just for the heck of it, I have also added this function to phytools & it is available from GitHub already.