A R-sig-phylo question yesterday read as follows (very slightly paraphrased):

*“Given a large newick tree e.g. from the new open tree of life (304959 tips)
I want to calculate a pairwise distance between a small subset of species. The
tricks I've seen so far [Ed. ape::cophenetic, which computes a N
× N patristic distance for a tree with N tips] do not work on
a tree of that size. I want to determine single species to single
species distances without having an algorithm attempt to calculate the
distance for all possible species (which is what ape is doing). Is there a
package/trick to accomplish this?”*

As a matter of fact,
this can be done in a reasonably efficient way using the phytools function
`fastHeight`

, which computes the height above the root of the MRCA of
any pair of species. That is because, for two species *i* and *j*,
the patristic distane between them is simply the sum of the heights above the
root for species *i* and *j* minus two times the height above the root
of the common ancestor of *i* & *j*.

To show that this works fairly well, let's try to write a little custom function,
`fastDist`

, to compute this distance using the heights as mentioned:

```
library(phytools)
fastDist<-function(tree,sp1,sp2){
fastHeight(tree,sp1,sp1)+fastHeight(tree,sp2,sp2)-
2*fastHeight(tree,sp1,sp2)
}
```

Now, let's test it on a somewhat large tree:

```
## simulate tree at random
tree<-rtree(n=5000)
sp1<-sample(tree$tip.label)[1]
sp1
```

```
## [1] "t941"
```

```
sp2<-sample(tree$tip.label)[2]
sp2
```

```
## [1] "t2799"
```

```
## compute using cophenetic
system.time(d<-cophenetic(tree)[sp1,sp2])
```

```
## user system elapsed
## 5.20 0.68 5.96
```

```
d
```

```
## [1] 11.34199
```

```
## compute using fastDist
system.time(d<-fastDist(tree,sp1,sp2))
```

```
## user system elapsed
## 0.05 0.00 0.04
```

```
d
```

```
## [1] 11.34199
```

Having done that, let's go for broke with a tree of 304,959, just as described:

```
tree<-rtree(n=304959)
tree
```

```
##
## Phylogenetic tree with 304959 tips and 304958 internal nodes.
##
## Tip labels:
## t51258, t83258, t60186, t263493, t117091, t210153, ...
##
## Rooted; includes branch lengths.
```

```
sp1<-sample(tree$tip.label)[1]
sp1
```

```
## [1] "t51585"
```

```
sp2<-sample(tree$tip.label)[2]
sp2
```

```
## [1] "t244730"
```

```
## just to check that our MRCA is not the root
fastMRCA(tree,sp1,sp2)
```

```
## [1] 427604
```

```
## compute using fastDist
system.time(d<-fastDist(tree,sp1,sp2))
```

```
## user system elapsed
## 4.80 1.34 6.21
```

```
d
```

```
## [1] 12.93951
```

Cool.

Just for the heck of it, I have also added this function to phytools & it is available from GitHub already.

## No comments:

## Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.