## Tuesday, March 15, 2011

### Matrix representation parsimony (MRP) supertree estimation

Are any of my readers aware of an existing function to do Matrix Representation Parsimony (MRP) supertree estimation (Baum 1992; Ragan 1992; also see Sanderson et al. 1998; Bininda-Emonds et al. 2007) in R? I'm not, but I was thinking it would be a fun - and easy - thing to program. Consequently I have just posted a function for this, called mrp.supertree(), on my R-phylogenetics page (direct link to code here). This function could also be used to get the MRP consensus tree (that is, if all the input phylogenies have the same set of taxa; e.g., see Felsenstein 2004 pp. 527-528).

I programmed this by using two handy functions: the function prop.part() in {ape} (which returns all the non-trivial bipartitions for a given tree) and pratchet() in {phangorn} (Schliep 2010), which estimates the MP tree using the parsimony ratchet method of Nixon (1999).

The way MRP supertree estimation works is by first constructing N (for N trees) ni x mi binary matrices (for ni species and mi non-trivial bipartitions in the ith tree). We score the tree matrices by putting a 1 in the j,kth position if the jth taxon is found in the kth bipartition, for each bipartition, and a 0 otherwise. We then accumulate these matrices into a supermatrix which contains union(speciesi) rows and sum(mi) columns. We put a "?" into the supermatrix if a species is missing from the tree. Finally, our MRP supertree is the MP tree estimated for this binary supermatrix.

To see how well this function does, let's first generate a random reference tree with 50 species:

> true.tree<-rtree(n=50,br=NULL)

Now let's randomly prune 40-60% of the species from this tree ten times:

> trees<-list()
> class(trees)<-"multiPhylo"
> for(i in 1:10)
trees[[i]]<- drop.tip(true.tree,sample(true.tree\$tip.label,
size=(20+round(runif(1)*10))))

Now let's load our MRP supertree code from source and run it:

> source("mrp.supertree.R")
> super.tree<-mrp.supertree(trees)
The MRP supertree, optimized via pratchet(),
has a parsimony score of 247 (minimum 247)

Now we can compare the estimated supertree to the true tree using dist.topo() in {ape}:

> dist.topo(true.tree,super.tree)
[1] 4

In this case, it is close. If we have high overlap this will generally be the case. For instance if we only drop 10-20 species from the each tree at random:

> for(i in 1:10)
trees[[i]]<- drop.tip(true.tree,sample(true.tree\$tip.label,
size=(10+round(runif(1)*10))))
> super.tree<-mrp.supertree(trees)
The MRP supertree, optimized via pratchet(),
has a parsimony score of 315 (minimum 315)
> dist.topo(true.tree,super.tree)
[1] 0

whereas if we drop 30-40 at random:

> for(i in 1:10)
trees[[i]]<-drop.tip(true.tree,
sample(true.tree\$tip.label,
size=(30+round(runif(1)*10))))
> super.tree<-mrp.supertree(trees)
The MRP supertree, optimized via pratchet(),
has a parsimony score of 208 (minimum 141)
> dist.topo(true.tree,super.tree)
[1] 82

A few notes on this implementation. Firstly, I set pratchet() to return all equally parsimonious trees (if several are found, mrp.supertree() will return a "multiPhylo" object) - but this is not the same as a guarantee to find all most parsimonious trees. This could only be guaranteed with an exhaustive search. Secondly, pratchet() does not always find the MP tree - so you might run mrp.supertree() multiple times if the parsimony score is greater than the theoretically minimal score.

Feedback welcome.

1. Hi,

This is actually useful for me! I was looking for a way to build a MRP. I'm not very familiarized with R, but I try to load a newick file with all the trees and it gives me a error. How are you passing the trees to the function?

I have a newick file with the trees inside.

2. The function takes an object of the class "multiPhylo" (that is, a list of trees stored in memory). If you are starting with a simple text file with Newick trees (named, for instance, "treefile.tre") in a list, e.g.:

(A,(B,C));
(B,(C,D));

Then you should do the following:

require(phytools)
mrpTree<-mrp.supertree(trees)

I hope this helps!

Note that the supertree is unrooted and has no branch lengths.

- Liam

3. Thanks!
Now I'm having an error creating the matrix:

Error in XX[rownames(X[[i]]), c(j:((j - 1) + ncol(X[[i]])))] <- X[[i]][1:nrow(X[[i]]), :
subscript out of bounds

I don't really understand why it doesn't work, also it seems to put empties as labels..

Sorry for the inconvenience..
Appreciate!

4. Hi. I'm not sure what the problem is, to be honest. If you email me your datafile (liam.revell@umb.edu) or, assuming they are small trees, post your Newick tree strings here, I can try to help more. - Liam

5. I've solved the problem. read.tree reads newick files without distances. I had distances, now I have to generate trees without distances

6. Nice package which saved a lot of work for me! Thank you very much! But one question remains: Is it possible to bootstrap the supertree to get a confidence value for its splits?

7. so... cannot we work with trees with distances?!

8. The method does not use the branch lengths of the original tree, if that's what you mean.

9. I already got my mrpTree. HOw could I save the tree (in newick format, for instance) in a different file? I am trying with save(mrpTree, file="xxxxx") but the file cannot be readed...

10. You can write the tree to file using write.tree or write.nexus in the ape package, depending on how you want to store it.
- Liam

11. Hi Dima. Hard to diagnose from the error message, which is not familiar. (Looks like a segmentation fault, which is usually due to memory allocation problems, but I cannot remember seeing this in R.) Feel free to send me the datafiles and I will try and run. - Liam

12. Hi,

Do you know of a more memory efficient way of of constructing a supertree for high numbers of taxa and trees, e.g. of the order of 4000 taxa and 500 trees?

Thankyou, Richard

1. Hi Richard. I don't as my research is not on supertree estimation. (I programmed this function for a phylogenetics class I taught last year.) If you can't find what you're looking for online, maybe you should email your query to the R-SIG-phylo email list, as there are lots of experts who subscribe. Good luck! Liam

13. Hello,
I am trying to build a supertree (just getting started, I am trying to combine two trees now) but after doing this:
for(i in 1:length(trees)){
temp<-prop.part(trees[[i]]) # find all bipartitions
# create matrix representation of phy[[i]] in X[[i]]
X[[i]]<-matrix(0,nrow=length(trees[[i]][,"tip"]),ncol=length(temp)-1)
for(j in 1:ncol(X[[i]])) X[[i]][c(temp[[j+1]]),j]<-1
rownames(X[[i]])<-attr(temp,"labels") # label rows
if(i==1) species<-trees[[i]][,"tip.label"]
else species<-union(species,trees[[i]][,"tip.label"]) # accumulate labels
characters<-characters+ncol(X[[i]]) # count characters
}
I keep getting the same error:
Error in x[[1]]\$tip.label : \$ operator is invalid for atomic vectors

Even when I change \$tip.label to [,"tip.label"]