## Saturday, March 2, 2013

### New version of matchNodes; new minor phytools version

I just made a couple of small updates to matchNodes (1, 2). I wrote this function primarily to be called internally by fastAnc, for which it works just fine, but I've since been frustrated when trying to use it in any task for which it wasn't originally purposed.

More specifically, the function is designed to match nodes between trees that are identical (to some measure of numerical precision) in species, topology, and possibly branch lengths (depending on method). When I tried to use it to match nodes across trees that were identical in core structure, but had different tips added, the function broke down.

The new version should (hopefully) fix this problem. Now, if trees 1 & 2 contains taxa A, B, ..., N, but tree 1 also contains taxa Q, R, S, while tree 2 contains extra taxa T, U, V, the function (using method="distances") should be able to overcome this difference and match corresponding nodes across trees.

Here's a quick demo of what I mean:
> tree<-pbtree(n=10)
> layout(c(1,2))
> plotTree(a,node.numbers=T)
> plotTree(b,node.numbers=T)
> matchNodes(a,b,"distances")
tr1 tr2
[1,]  16  16
[2,]  17  NA
[3,]  18  17
[4,]  19  NA
[5,]  20  18
[6,]  21  NA
[7,]  22  19
[8,]  23  20
[9,]  24  21
[10,]  25  NA
[11,]  26  22
[12,]  27  23
[13,]  28  NA
[14,]  29  25
> matchNodes(b,a,"distances")
tr1 tr2
[1,]  16  16
[2,]  17  18
[3,]  18  20
[4,]  19  22
[5,]  20  23
[6,]  21  24
[7,]  22  26
[8,]  23  27
[9,]  24  NA
[10,]  25  29
[11,]  26  NA
[12,]  27  NA
[13,]  28  NA
[14,]  29  NA

Inspection of these matrices, and the original trees, should show that matchNodes(a,b,"distances") gives the nodes of b (in column 2) that match each node in a; whereas matchNodes(b,a,"distances") gives the reverse.

One little nuance of this method is that we should probably allow it to tolerate inexact matches. This is because adding new edges to the tree, particularly if we then write and read the tree to and from file, will introduce random error to the distances between species and nodes - just because of rounding of branch lengths due to numerical precision of your computer or your file output format specifications. matchNodes has an argument for that: the optional argument, tol. Let's try rounding the branch lengths of each tree, examine the consequences, and then see if it can be fixed by increasing tol:
> a\$edge.length<-round(a\$edge.length,4)
> b\$edge.length<-round(b\$edge.length,4)
> matchNodes(a,b,"distances")
tr1 tr2
[1,]  16  NA
[2,]  17  NA
[3,]  18  NA
[4,]  19  NA
[5,]  20  NA
[6,]  21  NA
[7,]  22  NA
[8,]  23  NA
[9,]  24  NA
[10,]  25  NA
[11,]  26  NA
[12,]  27  NA
[13,]  28  NA
[14,]  29  NA
> # uh-oh!!
> matchNodes(a,b,"distances",tol=0.001)
tr1 tr2
[1,]  16  16
[2,]  17  NA
[3,]  18  17
[4,]  19  NA
[5,]  20  18
[6,]  21  NA
[7,]  22  19
[8,]  23  20
[9,]  24  21
[10,]  25  NA
[11,]  26  22
[12,]  27  23
[13,]  28  NA
[14,]  29  25

Well, that's pretty cool.

The updated function is here, but it is also in a new minor release of phytools (phytools 0.2-23), along with the new function countSimmap.

1. This trick can be used to determine if all the possible nodes of tree a have been matched to nodes in tree b:
sum(!is.na(M[,2]))==drop.tip(a,setdiff(a\$tip, intersect(a\$tip,b\$tip)))\$Nnode
(It should be TRUE.)

- Liam

2. So, I take it that this matches nodes based on their tip-to-node distance?

An alternative I've used is to use prop.part to get the descendant tips of each node, drop any taxa not shared in common and match nodes that way. Of course, this leads to its own issues, with many-to-one matching of nodes when node (A,B) matches both nodes (A,B) and (A,B,D) in the second tree, when taxon D is not held in common.

1. Yes, based on the set of distances from each node to all tips.

Another method in matchNodes is method="descendants" and that uses the set of descendant tips of each node to match nodes across trees. This requires an exact match at present, although it could easily be modified to relax this.

3. Hello to all,
I've a ask... Is possible to compute the distance of each node to all tips of tree? In order to get a distance matrix of tips (rows) by nodes (columns)

1. You can use the ape function dist.nodes, e.g.:

n<-length(tree\$tip.label)
D<-dist.nodes(tree)[1:n,1:tree\$Nnode+n]
rownames(D)<-tree\$tip.label

- Liam

2. Thanks Liam. Very useful!

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.