## Saturday, February 9, 2013

### More on MRP supertree estimation in phytools

In a post from earlier today, I described some of the updates in a new version of the phytools function mrp.supertree. The two main changes are as follows: (1) now the user can decide which parsimony optimization method (optim.parsimony or pratchet) is called inside the function; and (2) now near total control of the optimizer has been transferred to the user.

What this means is that (with one exception*) all the options of optim.parsimony and pratchet (summarized here) can now be controlled by the user. The exception is really only a modification to how the optimizer is controlled. For mrp.supertree(...,method="optim.parsimony") the argument start is substituted for the argument tree in optim.parsimony; and, for both mrp.supertree(...,method="pratchet") and mrp.supertree(...,method="optim.parsimony"), the argument start can be an object of class "phylo", i.e., a true starting tree, or, it can be a method for obtaining the starting tree - where the options are start="NJ" or start="random". For start="NJ" we first compute the neighbor-joining tree from the distances in the MRP trees matrix. I can't justify this theoretically, but it does seem to put us in the neighborhood of the MRP MP tree more effectively than random chance.

I thought I'd just show a quick demo of how the new version of mrp.supertree can be used - as well as how bad it can be under some circumstances. One good thing about the modifications I've made to the function is that now it will more easily inherit any additional improvements Klaus makes to the optimizers in the phangorn package.

> # load phytools & source
> require(phytools)
> source("mrp.supertree.R")
> # simulate a tree
> tree<-rtree(n=100,rooted=FALSE)
> # function randomly subsamples a tree to n species
> foo<-function(x,n){
+ tips<-sample(x\$tip.label)[1:n]
+ x<-drop.tip(x,setdiff(x\$tip.label,tips))
+ }
> # generate a set of subsampled trees
> trees<-replicate(n=5,foo(tree,40),simplify=FALSE)
> class(trees)<-"multiPhylo"
> # now I have 5 trees subsampled differently from the same
> # parent tree
> # attempt supertree estimation
> st.nni<-mrp.supertree(trees) # default method
[1] "Best pscore so far: 289"
[1] "Best pscore so far: 289"v [1] ...
pratchet() found 5 supertrees
with a parsimony score of 285 (minimum 185)
> # SPR tree rearrangment
> st.spr<-mrp.supertree(trees,rearrangements="SPR")
[1] 288
[1] ...
[1] "Best pscore so far: 202"
[1] ...
The MRP supertree, optimized via pratchet(),
has a parsimony score of 202 (minimum 185)
> # SPR with a better starting tree
> st.nj<-mrp.supertree(trees,rearrangements="SPR", start="NJ")
[1] "Best pscore so far: 332"
[1] ...
The MRP supertree, optimized via pratchet(),
has a parsimony score of 199 (minimum 185)
> # optim.parsimony with SPR
> st.op<-mrp.supertree(trees,rearrangements="SPR", method="optim.parsimony",start="NJ")
[1] 288
[1] ...
Final p-score 202 after 37 nni operations
The MRP supertree, optimized via optim.parsimony(),
has a parsimony score of 202 (minimum 185)
> # ok, now as a check let's try with the true tree
> # as our starting tree
> # (we have to prune some tips that were not in our sampled trees)
> tips<-unique(as.vector(sapply(trees,function(x) x\$tip.label)))
> st.tt<-mrp.supertree(trees,method="optim.parsimony", start=drop.tip(tree,setdiff(tree\$tip.label,tips)))
Final p-score 185 after 0 nni operations
The MRP supertree, optimized via optim.parsimony(),
has a parsimony score of 185 (minimum 185)
> # obviously we didn't do so hot before, but let's verify
> # first compute the tree containing only subsampled tips
> sub<-drop.tip(tree,setdiff(tree\$tip.label,tips))
> sapply(st.nni,dist.topo,sub)
[1] 140 140 140 140 140
> dist.topo(st.spr,sub)
[1] 82
> dist.topo(st.nj,drop.tip(tree,sub)
[1] 80
> dist.topo(st.op,sub)
[1] 82
> # as a check
> dist.topo(st.tt,sub)
[1] 0

This suggests to me that we have enough information in our input trees to get our true tree back (at least for the tips we sampled) - but due to limits on optimization, as far as it has been implemented so far - we're just not getting there.

For comparison, we could try a similar example in which we subsample a much larger fraction of the taxa in each tree - say 60% instead of 40%:

> # generate a set of subsampled trees
> trees<-replicate(n=5,foo(tree,60),simplify=FALSE)
> class(trees)<-"multiPhylo"
> # now I have 5 trees subsampled differently from the same
> # parent tree
> st.nni<-mrp.supertree(trees)
...
The MRP supertree, optimized via pratchet(),
has a parsimony score of 376 (minimum 285)
> st.spr<-mrp.supertree(trees,rearrangements="SPR")
...
The MRP supertree, optimized via pratchet(),
has a parsimony score of 285 (minimum 285)
> st.nj<-mrp.supertree(trees,rearrangements="SPR", start="NJ")
...
The MRP supertree, optimized via pratchet(),
has a parsimony score of 285 (minimum 285)
> st.op<-mrp.supertree(trees,rearrangements="SPR", method="optim.parsimony",start="NJ")
...
Final p-score 285 after 49 nni operations
The MRP supertree, optimized via optim.parsimony(),
has a parsimony score of 285 (minimum 285)
> st.tt<-mrp.supertree(trees,method="optim.parsimony", start=tree)
Final p-score 285 after 0 nni operations
The MRP supertree, optimized via optim.parsimony(),
has a parsimony score of 285 (minimum 285)
> dist.topo(st.nni,tree)
[1] 92
> dist.topo(st.spr,tree)
[1] 14
> dist.topo(st.nj,tree)
[1] 14
> dist.topo(st.op,tree)
[1] 14
> dist.topo(st.tt,tree)
[1] 0

No huge surprise that we do much better when we have more data. Something that is notable is that even though we get a "perfect score," so to speak, in all runs except for the default - there is still some topological discordance between the supertrees and the true trees. Most likely this is because some splits are not found in any of the trees used for supertree estimation. Effectively, heuristic parsimony searching will find just one of the possible resolutions of the true uncertainty about this node. The consequence - topological discordance between the estimated and the true trees.