## Tuesday, May 29, 2012

### Function to efficiently return the MRCA of a pair of species for trees with many tips

A user reports that she wants to use the phytools function findMRCA to get the MRCA of a group of species. The problem is that her phylogeny is very large (~60,000 tips) and findMRCA uses the ape utility function mrca which does not work for very large trees. (The reason that mrca will not work for trees this large is probably because mrca returns a giant matrix containing the MRCA of every pair of species in the tree. For a tree with 60K taxa, this matrix would require over 14GB of memory.) She asks if there might be a different way to accomplish this task.

The way I approached this problem was first just to see if there was a more efficient way to get the MRCA of a single pair of species. There is. Here, I use the function Ancestors from the phangorn package:

fastMRCA<-function(tree,sp1,sp2){
x<-match(sp1,tree\$tip.label)
y<-match(sp2,tree\$tip.label)
a<-Ancestors(tree,x)
b<-Ancestors(tree,y)
z<-a%in%b
return(a[min(which(z))])
}

The code x<-match(sp1,tree\$tip.label) translates the input tip name to a node number; a<-Ancestors(tree,x) returns a vector containing the node numbers of the ancestors of sp1, from the tip to the root; finally z<-a%in%b and a[min(which(z))] finds the elements of a that are in b, and then identifies which of these common elements are closest to the tips of the tree (i.e., the most recent). That's it! It seems to work, even on very large trees. The next natural step is use the same approach to find the MRCA of a set of species, just as in findMRCA.