Sunday, April 3, 2011

Exhaustive and Branch & Bound MP searching

To respond to a post by Matthew Vavrek (and for fun, besides), I just created a function, exhaustiveMP(), that performs exhaustive as well as branch & bound MP tree searches. To do the parsimony calculations and the tree search it calls functions from {ape} and (especially) {phangorn}. I have posted this to my R-phylogenetics page (direct link to code, here).

The main challenge in doing this was adding a tip in all possible places on the tree (the subject of my last post); once this was solved, the rest was easily done using the {phangorn} functions parsimony() and allTrees(); and the {ape} function bind.tree().

This function could easily be modified to conduct ML exhaustive and branch & bound tree searches, using, for instance, pml() from the {phangorn} package. I use parsimony here only because Matthew's inquiry was about MP; and because computing the parsimony score is much faster.

Of course, nothing about this function is fast. It will in fact be very slow for more than 7 or 8 species. Note that method="exhaustive" allows no more than 10 species; and method="branch.and.bound" no more than 15. I have not tried the function with that many species.

I believe this function works properly, but I welcome any experiences or feedback.

2 comments:

  1. Hi Liam,

    nice code. Here are two small additions. The first is a heuristic of the input order of the labels. We want to add always the taxa which is furthest apart from the taxa in the tree, to push the parsimony score over the bound as quick as possible for "bad" trees.
    The second function I added computes parsimonious uninformative sites. There are more similar tricks which should be incorporated.

    In the example below summaryRprof() showed that there is lots of potential of improving, especially bind.tree(). I takes about 4 sec. (+1 sec to reorder the tree), from 14 sec. in total. Computing the parsimony score takes only ~2.5 sec.
    Here is the code:

    parsinfo = function(x){
    nr <- attr(x, "nr")
    labels = attr(x, "allLevels")
    result = matrix(NA, nrow = length(x), ncol = nr)
    for (i in 1:length(x)) result[i, ] <- x[[i]]
    contrast <- attr(x, "contrast")
    D = tcrossprod(contrast)
    l = dim(contrast)[1]
    tab = apply(result, 2, tabulate, l)
    tab2 = tab
    tab2[] = FALSE
    tab2[tab>1] = TRUE
    tab3 = tab
    tab3[] = FALSE
    tab3[tab>0] = TRUE ## ==1
    tab3 = t(tab3)
    ps = rowSums(tab3) - 1
    diag(D) = 0
    tab4 = tab3 %*% D
    tab4 = tab3 * tab4 #
    ind = which(colSums(tab2)==1 & rowSums(tab4)==0)
    ps = rowSums(tab3) - 1
    cbind(ind, ps[ind])
    }

    getorder <- function(x){
    label = names(x)
    dm = as.matrix(dist.hamming(x, FALSE))
    ind = which(dm==max(dm), arr.ind=TRUE)[1,]
    added = label[ind]
    remaining<-label[-ind]
    for(i in 3:length(x)){
    tmp = which.max(rowSums(dm[-ind, ind, drop=FALSE]))
    added = c(added, remaining[tmp])
    ind = c(ind, match(remaining[tmp], label))
    remaining<-remaining[-tmp]
    }
    added
    }


    branch.and.bound.2 <- function(data, tree=NULL, trace=1, ...){
    # first, compute the parsimony score on the input tree
    tree = pratchet(data, start=tree, trace=trace-1, ...)
    # use only parsimony informative sites
    nr <- attr(data, "nr")
    pis <- parsinfo(data)
    p0 <- sum(attr(data, 'weight')[pis[,1]] * pis[,2])
    data.orig <- data
    data <- phangorn:::getRows(data, c(1:nr)[-pis[,1]], TRUE)
    #compute input order
    label = getorder(data)
    data <- phangorn:::prepareDataFitch(data)
    #compute upper bound
    bound<- phangorn:::fit.fitch(tree,data)
    # now pick three species to start
    new<-list(stree(n=3,tip.label=label[1:3]))
    class(new)<-"multiPhylo"
    added<-new[[1]]$tip.label; remaining<-setdiff(label,added)
    # branch & bound
    while(length(remaining)>0){
    old<-new; new<-list()
    new.tip<-remaining[1]
    pscores<-vector()
    for(i in 1:length(old)){
    temp<-add.everywhere(old[[i]],new.tip)
    temp <- unclass(temp)
    score = sapply(temp, phangorn:::fit.fitch, data)
    # score<-fitch(temp,data)
    new<-unlist(list(new,temp[score<=bound]),recursive=FALSE); class(new)<-"multiPhylo"
    pscores<-c(pscores,score[score<=bound])
    }
    added<-c(added,new.tip)
    if(trace) print(paste(length(added),"species added;",length(new),"trees retained",collapse=""))
    remaining<-setdiff(tree$tip.label,added)
    }
    # ok, done, now sort what needs to be returned
    trees<-new[pscores==min(pscores)]
    for(i in 1:length(trees)) attr(trees[[i]],"pscore")<-min(pscores)+p0
    return(trees) # return all mp trees
    }


    library(phangorn)
    data(yeast)
    tree = pratchet(yeast)
    system.time(tree1 <- branch.and.bound(yeast, tree))
    system.time(tree2 <- branch.and.bound.2(yeast, tree))

    ReplyDelete
  2. Great post Klaus. I will try it out when I get a chance.

    ReplyDelete