To respond to a post by Matthew Vavrek (and for fun, besides), I just created a function, exhaustiveMP(), that performs exhaustive as well as branch & bound MP tree searches. To do the parsimony calculations and the tree search it calls functions from {ape} and (especially) {phangorn}. I have posted this to my R-phylogenetics page (direct link to code, here).

The main challenge in doing this was adding a tip in all possible places on the tree (the subject of my last post); once this was solved, the rest was easily done using the {phangorn} functions parsimony() and allTrees(); and the {ape} function bind.tree().

This function could easily be modified to conduct ML exhaustive and branch & bound tree searches, using, for instance, pml() from the {phangorn} package. I use parsimony here only because Matthew's inquiry was about MP; and because computing the parsimony score is much faster.

Of course, nothing about this function is fast. It will in fact be very slow for more than 7 or 8 species. Note that method="exhaustive" allows no more than 10 species; and method="branch.and.bound" no more than 15. I have not tried the function with that many species.

I believe this function works properly, but I welcome any experiences or feedback.

Hi Liam,

ReplyDeletenice code. Here are two small additions. The first is a heuristic of the input order of the labels. We want to add always the taxa which is furthest apart from the taxa in the tree, to push the parsimony score over the bound as quick as possible for "bad" trees.

The second function I added computes parsimonious uninformative sites. There are more similar tricks which should be incorporated.

In the example below summaryRprof() showed that there is lots of potential of improving, especially bind.tree(). I takes about 4 sec. (+1 sec to reorder the tree), from 14 sec. in total. Computing the parsimony score takes only ~2.5 sec.

Here is the code:

parsinfo = function(x){

nr <- attr(x, "nr")

labels = attr(x, "allLevels")

result = matrix(NA, nrow = length(x), ncol = nr)

for (i in 1:length(x)) result[i, ] <- x[[i]]

contrast <- attr(x, "contrast")

D = tcrossprod(contrast)

l = dim(contrast)[1]

tab = apply(result, 2, tabulate, l)

tab2 = tab

tab2[] = FALSE

tab2[tab>1] = TRUE

tab3 = tab

tab3[] = FALSE

tab3[tab>0] = TRUE ## ==1

tab3 = t(tab3)

ps = rowSums(tab3) - 1

diag(D) = 0

tab4 = tab3 %*% D

tab4 = tab3 * tab4 #

ind = which(colSums(tab2)==1 & rowSums(tab4)==0)

ps = rowSums(tab3) - 1

cbind(ind, ps[ind])

}

getorder <- function(x){

label = names(x)

dm = as.matrix(dist.hamming(x, FALSE))

ind = which(dm==max(dm), arr.ind=TRUE)[1,]

added = label[ind]

remaining<-label[-ind]

for(i in 3:length(x)){

tmp = which.max(rowSums(dm[-ind, ind, drop=FALSE]))

added = c(added, remaining[tmp])

ind = c(ind, match(remaining[tmp], label))

remaining<-remaining[-tmp]

}

added

}

branch.and.bound.2 <- function(data, tree=NULL, trace=1, ...){

# first, compute the parsimony score on the input tree

tree = pratchet(data, start=tree, trace=trace-1, ...)

# use only parsimony informative sites

nr <- attr(data, "nr")

pis <- parsinfo(data)

p0 <- sum(attr(data, 'weight')[pis[,1]] * pis[,2])

data.orig <- data

data <- phangorn:::getRows(data, c(1:nr)[-pis[,1]], TRUE)

#compute input order

label = getorder(data)

data <- phangorn:::prepareDataFitch(data)

#compute upper bound

bound<- phangorn:::fit.fitch(tree,data)

# now pick three species to start

new<-list(stree(n=3,tip.label=label[1:3]))

class(new)<-"multiPhylo"

added<-new[[1]]$tip.label; remaining<-setdiff(label,added)

# branch & bound

while(length(remaining)>0){

old<-new; new<-list()

new.tip<-remaining[1]

pscores<-vector()

for(i in 1:length(old)){

temp<-add.everywhere(old[[i]],new.tip)

temp <- unclass(temp)

score = sapply(temp, phangorn:::fit.fitch, data)

# score<-fitch(temp,data)

new<-unlist(list(new,temp[score<=bound]),recursive=FALSE); class(new)<-"multiPhylo"

pscores<-c(pscores,score[score<=bound])

}

added<-c(added,new.tip)

if(trace) print(paste(length(added),"species added;",length(new),"trees retained",collapse=""))

remaining<-setdiff(tree$tip.label,added)

}

# ok, done, now sort what needs to be returned

trees<-new[pscores==min(pscores)]

for(i in 1:length(trees)) attr(trees[[i]],"pscore")<-min(pscores)+p0

return(trees) # return all mp trees

}

library(phangorn)

data(yeast)

tree = pratchet(yeast)

system.time(tree1 <- branch.and.bound(yeast, tree))

system.time(tree2 <- branch.and.bound.2(yeast, tree))

Great post Klaus. I will try it out when I get a chance.

ReplyDeleteNice post. Keep updating

ReplyDeleteArtificial Intelligence Online Training