## Monday, May 30, 2016

### Identifying the unique tree topologies in your dataset consisting of a `"multiPhylo"` object (and then averaging some quantity over those trees)

A R-sig-phylo subscriber today posted the question:

“I have a two-column matrix, containing trees in one column and a numeric value associated with each tree in the other column. I would like to go through the list to identify any duplicated topologies, and for each set of duplicates sum the corresponding numeric values and prune away the duplicate trees.

For example,

1.0 A
1.5 A
0.2 B
0.8 C
1.0 A
3.2 B

Would become

3.5 A
3.4 B
0.8 C

Any help would be great!”

Both Emmanuel & I misunderstood the question as we both thought he had already identified the unique trees in his set; however this is not the case making the problem a little bit more complicated.

However, here is a solution. The following data consist of a set of trees some of which match in topology, but that have different edge lengths. I'm going to imagine that quantity we want to average by topology is total tree depth:

``````library(phytools)
## our function to identify unique topologies
find.unique<-function(trees,...){
if(hasArg(use.edge.length)) use.edge.length<-list(...)\$use.edge.length
else use.edge.length<-TRUE
UNIQUE<-unique(trees,use.edge.length=use.edge.length)
NAMES<-vector(mode="numeric",length=length(trees))
for(i in 1:length(UNIQUE))
NAMES[which(sapply(trees,all.equal.phylo,UNIQUE[[i]],...))]<-i
as.factor(NAMES)
}
uu<-find.unique(trees,use.edge.length=FALSE)
uu
``````
``````##    1  2  3  2  4  5  6  7  8  1  6  8  1  5  2  2  4  8  6  2  5  8  7
##   8  6  2  4  6  3  9  8  6  6  1  9  8  3  4  9  2  1  2  10 8  10 3
##   9  8  6  10 3  1  1  7  7  8  7  1  2  5  2  4  4  3  8  5  10 4  7
##   4  3  3  8  9  9  8  8  3  5  6  1  3  7  8  6  10 6  7  10 1  9  5
##   8  6  4  1  9  3  5  2
## Levels: 1 2 3 4 5 6 7 8 9 10
``````
``````hh<-sapply(trees,function(x) max(nodeHeights(x)))
hh
``````
``````##    2.575770 2.491355 3.446717 2.463738 2.896513 3.381385 2.706796
##    2.327145 3.018294 2.774250 2.576422 3.629413 3.075006 3.113694
##   3.453049 2.838071 2.475296 4.003471 3.335423 2.647890 3.047151
##   3.586491 2.869413 2.833950 2.977307 2.710317 2.451870 2.657723
##   3.519410 2.591699 4.122369 2.735734 3.036613 3.217664 2.752407
##   2.561036 3.079548 1.861568 3.325974 2.561085 2.968872 3.138570
##   3.083009 3.988694 2.726606 3.222233 2.168503 2.846598 2.327718
##   2.752483 4.329426 2.803972 4.156142 3.912881 3.102875 3.332229
##   3.161519 3.074908 2.817042 2.587192 4.159316 2.098692 2.736230
##   3.556588 2.733052 2.801303 2.953829 3.230110 3.263049 3.351896
##   3.529590 4.574951 3.724574 2.658809 2.783343 2.743181 3.159748
##   3.440510 1.847069 3.300054 2.647170 3.207739 3.654765 2.478389
##   2.629565 2.698472 2.925122 3.108748 2.762958 2.668276 2.230470
##   2.486553 3.294030 2.602165 2.358616 3.578885 2.905291 4.739568
##   2.800023 3.203737
``````
``````obj<-data.frame(height=hh,tree=uu)
``````
``````##      height tree
## 1  2.575770    1
## 2  2.491355    2
## 3  3.446717    3
## 4  2.463738    2
## 5  2.896513    4
## 6  3.381385    5
## 7  2.706796    6
## 8  2.327145    7
## 9  3.018294    8
## 10 2.774250    1
## 11 2.576422    6
## 12 3.629413    8
## 13 3.075006    1
## 14 3.113694    5
## 15 3.453049    2
## 16 2.838071    2
## 17 2.475296    4
## 18 4.003471    8
## 19 3.335423    6
## 20 2.647890    2
``````
``````## now average by topology:
H<-aggregate(obj\$height,by=list(obj\$tree),mean)
H
``````
``````##    Group.1        x
## 1        1 3.049174
## 2        2 2.953106
## 3        3 3.695116
## 4        4 2.606755
## 5        5 2.758046
## 6        6 2.817554
## 7        7 3.175049
## 8        8 3.253470
## 9        9 2.677062
## 10      10 2.829559
``````
``````## we can even cross-check to make sure....
mean(sapply(trees[uu=="1"],function(x) max(nodeHeights(x))))
``````
``````##  3.049174
``````
``````mean(sapply(trees[uu=="10"],function(x) max(nodeHeights(x)))) ## for instance
``````
``````##  2.829559
``````

The data I used for this were simulated as follows:

``````trees<-rmtree(n=10,N=10)
trees<-sample(trees,replace=TRUE,size=100)
for(i in 1:100) trees[[i]]\$edge.length<-runif(n=nrow(trees[[i]]\$edge))
``````