Thursday, December 7, 2017

Matching edges between topologically identical trees with different node rotations & edge lengths (and computing their average values)

A few days ago a phytools user wrote:

“I apologize if this has been discussed elsewhere, but I am curious if you have an easy way to extract and compare homologous edges of a tree that all contain the same taxa but have different edge lengths and are written differently in there newick format? I would use grep to extract these normally but I have a case where I dated a large topology and the tree output is written differently than the tree I input in newick format. I want to check correlation of dates to the actual branch lengths but they are not in the same order in the newick format.”

I'm sure there are a number of ways to do this in R; however, one way is using the phytools function matchNodes.

library(phytools)
trees
## 9 phylogenetic trees
## first, the trees are all topologically identical:
densityTree(trees,type="cladogram",nodes="intermediate")

plot of chunk unnamed-chunk-1

## now that they have different node rotations & numbering
par(mfrow=c(3,3))
plotTree(trees,lwd=1,node.numbers=T,fsize=0.8)

plot of chunk unnamed-chunk-1

The function matchNodes matches node numbers between different trees based on (by default) commonality in the identity of the descendants of each node. In the following code I will match each of the trees in the set to the first - but this is arbitrary. Note that we must do this separately for labels (using sister function matchLabels) and then combine the two matrices.

M1<-matrix(NA,Ntip(trees[[1]]),length(trees),
    dimnames=list(trees[[1]]$tip.label,
    paste("t[[",1:length(trees),"]]",sep="")))
M2<-matrix(NA,trees[[1]]$Nnode,length(trees),
    dimnames=list(1:trees[[1]]$Nnode+Ntip(trees[[1]]),
    paste("t[[",1:length(trees),"]]",sep="")))
for(i in 1:length(trees)){
    M1[,i]<-matchLabels(trees[[1]],trees[[i]])[,2]
    M2[,i]<-matchNodes(trees[[1]],trees[[i]])[,2]
}
M<-rbind(M1,M2)
M
##    t[[1]] t[[2]] t[[3]] t[[4]] t[[5]] t[[6]] t[[7]] t[[8]] t[[9]]
## H       1     25      9     21      5     22      6      5      1
## I       2     26      8     20      6     21      7      6      2
## F       3     22      6     18      7     24      9      3      4
## E       4     23      7     17      8     23      8      2      5
## G       5     24      5     19      9     25     10      4      3
## A       6     20      3     25      4     17      2      8      7
## C       7     18      1     24      3     19      4     10      9
## B       8     19      2     23      2     18      3      9      8
## D       9     21      4     22      1     20      5      7      6
## J      10     17     10     26     10     26      1      1     10
## U      11      4     12     13     12      6     16     23     23
## V      12      3     13     11     14      4     14     24     22
## W      13      2     14     12     13      5     15     25     21
## Z      14      7     17     16     15      3     11     22     26
## Y      15      6     15     15     17      2     12     20     25
## X      16      5     16     14     16      1     13     21     24
## T      17      1     11     10     11      7     17     26     20
## O      18     15     25      4     23      8     22     18     16
## P      19     12     24      1     24      9     25     15     17
## R      20     14     23      3     25     11     24     17     18
## Q      21     13     22      2     26     10     23     16     19
## K      22      8     19      7     20     12     18     13     12
## L      23      9     18      8     19     13     19     14     13
## M      24     11     20      6     21     14     21     11     15
## N      25     10     21      5     22     15     20     12     14
## S      26     16     26      9     18     16     26     19     11
## 27     27     27     27     27     27     27     27     27     27
## 28     28     43     28     43     28     43     28     28     28
## 29     29     44     29     44     29     44     29     29     29
## 30     30     48     33     45     33     48     33     30     30
## 31     31     51     36     48     34     49     34     33     31
## 32     32     49     34     46     35     50     35     31     32
## 33     33     50     35     47     36     51     36     32     33
## 34     34     45     30     49     30     45     30     34     34
## 35     35     46     31     50     31     46     31     35     35
## 36     36     47     32     51     32     47     32     36     36
## 37     37     28     37     28     37     28     37     37     37
## 38     38     29     38     37     38     29     38     46     46
## 39     39     30     39     38     39     30     39     47     47
## 40     40     31     40     39     40     33     42     50     48
## 41     41     32     41     40     41     34     43     51     49
## 42     42     33     42     41     42     31     40     48     50
## 43     43     34     43     42     43     32     41     49     51
## 44     44     35     44     29     44     35     44     38     38
## 45     45     36     45     30     45     36     45     39     39
## 46     46     40     49     31     49     37     49     43     43
## 47     47     41     50     32     50     38     50     44     44
## 48     48     42     51     33     51     39     51     45     45
## 49     49     37     46     34     46     40     46     40     40
## 50     50     38     47     36     47     41     47     42     41
## 51     51     39     48     35     48     42     48     41     42

The way to interpret this is that the i,jth element of M matches the ith taxon label or node number in the row names to the nodes in the column jth tree.

Next we want to get the edge lengths of all the corresponding edges in all the trees. This should be fairly easy now. Note that we have one few edge length in each tree than the number of rows in M because (generally) the root node does not have an edge length.

M<-M[-Ntip(trees[[1]])-1,] ## trim root node from M
E<-matrix(NA,nrow(M),ncol(M),dimnames=dimnames(M))
for(i in 1:ncol(M)) for(j in 1:nrow(M))
    E[j,i]<-trees[[i]]$edge.length[which(trees[[i]]$edge[,2]==M[j,i])]
print(E,digits=3)
##    t[[1]] t[[2]] t[[3]] t[[4]] t[[5]] t[[6]] t[[7]] t[[8]] t[[9]]
## H  41.507  38.43  40.89  40.93 35.380 36.741 40.698  36.70  39.46
## I  41.507  38.43  40.89  40.93 35.380 36.741 40.698  36.70  39.46
## F  25.372  25.51  22.98  24.74 23.830 27.529 25.377  26.37  25.94
## E  25.372  25.51  22.98  24.74 23.830 27.529 25.377  26.37  25.94
## G  34.043  31.93  31.06  32.62 29.796 33.251 33.857  31.85  32.47
## A  28.602  33.14  29.13  29.46 28.879 25.782 32.828  26.38  26.89
## C   3.499   3.74   2.98   1.27  1.862  0.522  1.192   2.49   5.84
## B   3.499   3.74   2.98   1.27  1.862  0.522  1.192   2.49   5.84
## D  36.666  38.67  36.18  37.60 39.167 37.088 35.332  36.74  37.25
## J  97.374  94.51  92.68  90.95 93.051 97.177 91.018  94.77  91.71
## U  12.275  16.01  15.30  16.34 14.012 13.259 15.002  13.60  11.46
## V   5.248   4.90   4.79   3.60  3.848  4.543  9.768   6.24   4.03
## W   5.248   4.90   4.79   3.60  3.848  4.543  9.768   6.24   4.03
## Z   7.952  12.70  13.57  13.49  8.555 13.479 10.517  12.95   8.01
## Y   5.860  12.70  13.57  10.00  7.919 13.479 10.517  10.32   8.01
## X   5.860  12.70  13.57  10.00  7.919 13.479 10.517  10.32   8.01
## T  77.345  78.97  78.68  79.36 77.708 80.346 80.827  81.22  78.95
## O  33.773  30.63  32.62  32.31 29.050 29.524 31.983  35.08  35.77
## P  29.987  25.67  30.19  25.47 28.086 28.537 25.924  29.73  26.43
## R  18.898  15.71  16.82  15.43 16.459 13.036 12.350  15.81  15.68
## Q  18.898  15.71  16.82  15.43 16.459 13.036 12.350  15.81  15.68
## K  21.646  19.60  18.16  19.51 22.487 18.644 18.263  19.11  19.64
## L  21.646  19.60  18.16  19.51 22.487 18.644 18.263  19.11  19.64
## M  29.576  24.77  29.13  25.65 27.042 24.809 28.559  22.98  24.85
## N  29.576  24.77  29.13  25.65 27.042 24.809 28.559  22.98  24.85
## S  66.926  60.54  69.51  66.75 64.339 70.155 69.758  62.07  62.50
## 28  2.466   5.65   9.39   2.68  3.798  5.466 12.050   5.17   7.41
## 29  6.992   8.90   7.17   4.12  8.859  6.314  5.507   7.45   7.40
## 30 19.058  18.65  18.28  18.35 22.574 18.007 16.631  18.48  16.88
## 31 29.817  28.53  26.34  27.56 26.239 36.115 28.183  32.14  27.96
## 32 37.280  35.04  36.17  35.86 31.822 39.605 35.023  36.99  34.95
## 33  8.672   6.42   8.08   7.88  5.966  5.722  8.480   5.47   6.53
## 34 53.716  46.94  49.33  49.24 45.026 53.775 50.179  50.58  47.05
## 35  8.064   5.53   7.05   8.13 10.288 11.306  2.504  10.35  10.36
## 36 25.102  29.41  26.15  28.20 27.016 25.259 31.636  23.89  21.05
## 37 19.982  21.20  17.92  14.27 19.142 20.337 17.424  18.72  20.17
## 38  2.513   0.00   5.46   0.00  0.000  1.960  4.817   0.00   0.00
## 39 64.567  56.63  60.85  59.94 60.023 61.396 65.656  61.40  64.04
## 40  0.502   6.32   2.54   3.07  3.672  5.691  0.169   6.22   3.46
## 41  7.027  11.11  10.51  12.74 10.165  8.716  5.234   7.36   7.43
## 42  4.825   9.63   4.27   5.93  9.130  5.471  4.654   6.87   6.90
## 43  2.093   0.00   0.00   3.49  0.636  0.000  0.000   2.64   0.00
## 44 12.931  18.43  14.63  12.61 13.368 12.151 15.886  19.15  16.45
## 45 24.173  21.19  26.63  23.24 26.666 27.403 22.921  20.84  20.47
## 46  8.980   8.72  10.26  11.20  8.623 13.228 14.854   6.15   6.27
## 47  3.786   4.96   2.43   6.84  0.964  0.987  6.059   5.36   9.34
## 48 11.089   9.96  13.37  10.04 11.627 15.501 13.574  13.92  10.75
## 49  6.915   8.75  11.83  12.77  5.908  8.559 13.627  10.35  10.51
## 50 14.191  11.00  12.89  11.23  9.278 15.549 14.946  11.77  11.88
## 51  6.261   5.83   1.92   5.09  4.723  9.384  4.651   7.91   6.67

Neat.

Of course, if we want we could get the average edge lengths for corresponding edges across the set of trees as follows:

edge.length<-rowMeans(E)
ii<-sapply(trees[[1]]$edge[,2],function(x,y) which(y==x),y=M[,1])
tree<-trees[[1]]
tree$edge.length<-edge.length[ii]
plotTree(tree)

plot of chunk unnamed-chunk-4

We can see that when we average the edge lengths from a set of topologically identical ultrametric trees, the resulting tree is also ultrametric.

And, to see that this is indeed the average, let's overlay it on our plot made with densityTree, this time using a square phylogram plotting style:

densityTree(trees,nodes="centered",compute.consensus=FALSE)
par(fg="transparent")
plotTree(tree,nodes="centered",mar=par()$mar,
    direction="leftwards",xlim=get("last_plot.phylo",
    envir=.PlotPhyloEnv)$x.lim,add=TRUE,
    color=make.transparent("darkgrey",0.5),lwd=6,ftype="i")

plot of chunk unnamed-chunk-5

par(fg="black")

The 'average' tree is show in grey, whereas our sample of trees are given in blue.

Note that it is also possible to do this automatically (that is, obtaining the average values of corresponding edges from a set of trees) using the phytools function consensus.edges.

I hope this addresses the user's question.

For this exercise I simulated my set of trees as follows:

tree<-pbtree(n=26,tip.label=LETTERS,scale=100)
trees<-list()
for(i in 1:9){
    nodes<-sample(1:tree$Nnode+Ntip(tree),10)
    trees[[i]]<-tree
    trees[[i]]$edge.length<-abs(tree$edge.length+
        rnorm(n=nrow(tree$edge),sd=3))
    trees[[i]]<-force.ultrametric(trees[[i]])
    for(j in 1:length(nodes))
        trees[[i]]<-untangle(rotate(trees[[i]],nodes[j]),"read.tree")
}
class(trees)<-"multiPhylo"

although I assume that in the empirical case they would come from my inference method.

That's it.

No comments:

Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.