Monday, June 17, 2013

Getting a list of all sets of tips separated by zero length (or a distance less than some arbitrary value)

An R-sig-phylo user asks:

Is there a way to detect and list all tips of a tree with 0 length in their terminal branches (essentially duplicated sequences)? Would also be great if it's possible to output them in groups so that it's clear which tips are identical to which.

I answered the first part as follows:

I would advise setting some "tolerance" value, below which you consider a terminal edge to be zero. Then try:

tol<-1e-8
x<-na.omit(tree$tip.label[tree$edge[tree$edge.length<= tol,2]])
attr(x,"class")<-NULL
attr(x,"na.action")<-NULL

This works great for the first part - the only problem is that it ignores the second part by not telling us what tips are separated by tol or less distance from which other tips.

Here is a solution that addresses that:

D<-cophenetic(tree)
tol<-1e-12 # say
x<-apply(D,1,function(x) names(which(x<=tol)))
x<-unique(x)
x<-x[sapply(x,length)>1]

Let's take the following tree:

and try it out:
> require(ape)
Loading required package: ape
> D<-cophenetic(tree)
> tol<-1e-12 # say
> x<-apply(D,1,function(x) names(which(x<=tol)))
> x<-unique(x)
> x<-x[sapply(x,length)>1]
> x
[[1]]
[1] "t9" "t11"

[[2]]
[1] "t10" "t12" "t13"

[[3]]
[1] "t3" "t14"

[[4]]
[1] "t6" "t17"

[[5]]
[1] "t1" "t18" "t19"

Cool - it works.

No comments:

Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.