## Tuesday, May 21, 2013

### Extracting all subtrees of a given size & larger

An R-sig-phylo query asked the following:

I'd like to sample independent subclades of greater than N tips from a phylogenetic tree much greater than N, say (number of tips >= 10*N). I can extract all subclades of greater than N taxa, but sampling independent subclades seems to be a harder problem.

Well - I'm not completely sure if this what he's asking for, but here is a function that will get all the subtrees in a binary tree that cannot be further subdivided into two subtrees of size >= clade.size.

nn<-1:tree\$Nnode+length(tree\$tip.label)
ndesc<-function(tree,node){
x<-getDescendants(tree,node)
sum(x<=length(tree\$tip.label))
}
dd<-setNames(sapply(nn,ndesc,tree=tree),nn)
aa<-nn
nodes<-vector()
while(length(aa)){
bb<-sapply(aa,function(x,tree)
tree\$edge[which(tree\$edge[,1]==x),2],tree=tree)
cc<-matrix(dd[as.character(bb)],nrow(bb),ncol(bb))
cc[is.na(cc)]<-1
gg<-apply(cc,2,function(x,cs)
nodes<-c(nodes,aa[gg])
aa<-as.vector(bb[,!gg])
}
class(trees)<-"multiPhylo"
return(trees)
}

The way this works is as follows. First, we get the number of tips descended from each node in the tree:

nn<-1:tree\$Nnode+length(tree\$tip.label)
ndesc<-function(tree,node){
x<-getDescendants(tree,node)
sum(x<=length(tree\$tip.label))
}
dd<-setNames(sapply(nn,ndesc,tree=tree),nn)
Next, we conduct a traversal of the tree from the root. If a node has two daughter clades both of size >= clade.size, then we ascend up the tree to those nodes. If not, then we store the value of the current node. At the end, we extract all subtrees whose node numbers we have stored: