Monday, May 30, 2016

Identifying the unique tree topologies in your dataset consisting of a "multiPhylo" object (and then averaging some quantity over those trees)

A R-sig-phylo subscriber today posted the question:

“I have a two-column matrix, containing trees in one column and a numeric value associated with each tree in the other column. I would like to go through the list to identify any duplicated topologies, and for each set of duplicates sum the corresponding numeric values and prune away the duplicate trees.

For example,

1.0 A
1.5 A
0.2 B
0.8 C
1.0 A
3.2 B

Would become

3.5 A
3.4 B
0.8 C

Any help would be great!”

Both Emmanuel & I misunderstood the question as we both thought he had already identified the unique trees in his set; however this is not the case making the problem a little bit more complicated.

However, here is a solution. The following data consist of a set of trees some of which match in topology, but that have different edge lengths. I'm going to imagine that quantity we want to average by topology is total tree depth:

library(phytools)
## our function to identify unique topologies
find.unique<-function(trees,...){
    if(hasArg(use.edge.length)) use.edge.length<-list(...)$use.edge.length
    else use.edge.length<-TRUE
    UNIQUE<-unique(trees,use.edge.length=use.edge.length)
    NAMES<-vector(mode="numeric",length=length(trees))
    for(i in 1:length(UNIQUE)) 
        NAMES[which(sapply(trees,all.equal.phylo,UNIQUE[[i]],...))]<-i
    as.factor(NAMES)
}
uu<-find.unique(trees,use.edge.length=FALSE)
uu
##   [1] 1  2  3  2  4  5  6  7  8  1  6  8  1  5  2  2  4  8  6  2  5  8  7 
##  [24] 8  6  2  4  6  3  9  8  6  6  1  9  8  3  4  9  2  1  2  10 8  10 3 
##  [47] 9  8  6  10 3  1  1  7  7  8  7  1  2  5  2  4  4  3  8  5  10 4  7 
##  [70] 4  3  3  8  9  9  8  8  3  5  6  1  3  7  8  6  10 6  7  10 1  9  5 
##  [93] 8  6  4  1  9  3  5  2 
## Levels: 1 2 3 4 5 6 7 8 9 10
hh<-sapply(trees,function(x) max(nodeHeights(x)))
hh
##   [1] 2.575770 2.491355 3.446717 2.463738 2.896513 3.381385 2.706796
##   [8] 2.327145 3.018294 2.774250 2.576422 3.629413 3.075006 3.113694
##  [15] 3.453049 2.838071 2.475296 4.003471 3.335423 2.647890 3.047151
##  [22] 3.586491 2.869413 2.833950 2.977307 2.710317 2.451870 2.657723
##  [29] 3.519410 2.591699 4.122369 2.735734 3.036613 3.217664 2.752407
##  [36] 2.561036 3.079548 1.861568 3.325974 2.561085 2.968872 3.138570
##  [43] 3.083009 3.988694 2.726606 3.222233 2.168503 2.846598 2.327718
##  [50] 2.752483 4.329426 2.803972 4.156142 3.912881 3.102875 3.332229
##  [57] 3.161519 3.074908 2.817042 2.587192 4.159316 2.098692 2.736230
##  [64] 3.556588 2.733052 2.801303 2.953829 3.230110 3.263049 3.351896
##  [71] 3.529590 4.574951 3.724574 2.658809 2.783343 2.743181 3.159748
##  [78] 3.440510 1.847069 3.300054 2.647170 3.207739 3.654765 2.478389
##  [85] 2.629565 2.698472 2.925122 3.108748 2.762958 2.668276 2.230470
##  [92] 2.486553 3.294030 2.602165 2.358616 3.578885 2.905291 4.739568
##  [99] 2.800023 3.203737
obj<-data.frame(height=hh,tree=uu)
head(obj,n=20)
##      height tree
## 1  2.575770    1
## 2  2.491355    2
## 3  3.446717    3
## 4  2.463738    2
## 5  2.896513    4
## 6  3.381385    5
## 7  2.706796    6
## 8  2.327145    7
## 9  3.018294    8
## 10 2.774250    1
## 11 2.576422    6
## 12 3.629413    8
## 13 3.075006    1
## 14 3.113694    5
## 15 3.453049    2
## 16 2.838071    2
## 17 2.475296    4
## 18 4.003471    8
## 19 3.335423    6
## 20 2.647890    2
## now average by topology:
H<-aggregate(obj$height,by=list(obj$tree),mean)
H
##    Group.1        x
## 1        1 3.049174
## 2        2 2.953106
## 3        3 3.695116
## 4        4 2.606755
## 5        5 2.758046
## 6        6 2.817554
## 7        7 3.175049
## 8        8 3.253470
## 9        9 2.677062
## 10      10 2.829559
## we can even cross-check to make sure....
mean(sapply(trees[uu=="1"],function(x) max(nodeHeights(x))))
## [1] 3.049174
mean(sapply(trees[uu=="10"],function(x) max(nodeHeights(x)))) ## for instance
## [1] 2.829559

The data I used for this were simulated as follows:

trees<-rmtree(n=10,N=10)
trees<-sample(trees,replace=TRUE,size=100)
for(i in 1:100) trees[[i]]$edge.length<-runif(n=nrow(trees[[i]]$edge))

No comments:

Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.