A few days ago a phytools user wrote:
“I apologize if this has been discussed elsewhere, but I am curious if you
have an easy way to extract and compare homologous edges of a tree that all
contain the same taxa but have different edge lengths and are written
differently in there newick format? I would use grep to extract these normally
but I have a case where I dated a large topology and the tree output is written
differently than the tree I input in newick format. I want to check
correlation of dates to the actual branch lengths but they are not in the same
order in the newick format.”
I'm sure there are a number of ways to do this in R; however, one way is using
the phytools function matchNodes
.
library(phytools)
trees
## 9 phylogenetic trees
## first, the trees are all topologically identical:
densityTree(trees,type="cladogram",nodes="intermediate")
## now that they have different node rotations & numbering
par(mfrow=c(3,3))
plotTree(trees,lwd=1,node.numbers=T,fsize=0.8)
The function matchNodes
matches node numbers between different
trees based on (by default) commonality in the identity of the descendants
of each node. In the following code I will match each of the trees in the set
to the first - but this is arbitrary. Note that we must do this separately
for labels (using sister function matchLabels
) and then combine
the two matrices.
M1<-matrix(NA,Ntip(trees[[1]]),length(trees),
dimnames=list(trees[[1]]$tip.label,
paste("t[[",1:length(trees),"]]",sep="")))
M2<-matrix(NA,trees[[1]]$Nnode,length(trees),
dimnames=list(1:trees[[1]]$Nnode+Ntip(trees[[1]]),
paste("t[[",1:length(trees),"]]",sep="")))
for(i in 1:length(trees)){
M1[,i]<-matchLabels(trees[[1]],trees[[i]])[,2]
M2[,i]<-matchNodes(trees[[1]],trees[[i]])[,2]
}
M<-rbind(M1,M2)
M
## t[[1]] t[[2]] t[[3]] t[[4]] t[[5]] t[[6]] t[[7]] t[[8]] t[[9]]
## H 1 25 9 21 5 22 6 5 1
## I 2 26 8 20 6 21 7 6 2
## F 3 22 6 18 7 24 9 3 4
## E 4 23 7 17 8 23 8 2 5
## G 5 24 5 19 9 25 10 4 3
## A 6 20 3 25 4 17 2 8 7
## C 7 18 1 24 3 19 4 10 9
## B 8 19 2 23 2 18 3 9 8
## D 9 21 4 22 1 20 5 7 6
## J 10 17 10 26 10 26 1 1 10
## U 11 4 12 13 12 6 16 23 23
## V 12 3 13 11 14 4 14 24 22
## W 13 2 14 12 13 5 15 25 21
## Z 14 7 17 16 15 3 11 22 26
## Y 15 6 15 15 17 2 12 20 25
## X 16 5 16 14 16 1 13 21 24
## T 17 1 11 10 11 7 17 26 20
## O 18 15 25 4 23 8 22 18 16
## P 19 12 24 1 24 9 25 15 17
## R 20 14 23 3 25 11 24 17 18
## Q 21 13 22 2 26 10 23 16 19
## K 22 8 19 7 20 12 18 13 12
## L 23 9 18 8 19 13 19 14 13
## M 24 11 20 6 21 14 21 11 15
## N 25 10 21 5 22 15 20 12 14
## S 26 16 26 9 18 16 26 19 11
## 27 27 27 27 27 27 27 27 27 27
## 28 28 43 28 43 28 43 28 28 28
## 29 29 44 29 44 29 44 29 29 29
## 30 30 48 33 45 33 48 33 30 30
## 31 31 51 36 48 34 49 34 33 31
## 32 32 49 34 46 35 50 35 31 32
## 33 33 50 35 47 36 51 36 32 33
## 34 34 45 30 49 30 45 30 34 34
## 35 35 46 31 50 31 46 31 35 35
## 36 36 47 32 51 32 47 32 36 36
## 37 37 28 37 28 37 28 37 37 37
## 38 38 29 38 37 38 29 38 46 46
## 39 39 30 39 38 39 30 39 47 47
## 40 40 31 40 39 40 33 42 50 48
## 41 41 32 41 40 41 34 43 51 49
## 42 42 33 42 41 42 31 40 48 50
## 43 43 34 43 42 43 32 41 49 51
## 44 44 35 44 29 44 35 44 38 38
## 45 45 36 45 30 45 36 45 39 39
## 46 46 40 49 31 49 37 49 43 43
## 47 47 41 50 32 50 38 50 44 44
## 48 48 42 51 33 51 39 51 45 45
## 49 49 37 46 34 46 40 46 40 40
## 50 50 38 47 36 47 41 47 42 41
## 51 51 39 48 35 48 42 48 41 42
The way to interpret this is that the i,jth element of
M
matches the ith taxon label or node number in the
row names to the nodes in the column jth tree.
Next we want to get the edge lengths of all the corresponding edges in all
the trees. This should be fairly easy now. Note that we have one few edge
length in each tree than the number of rows in M
because
(generally) the root node does not have an edge length.
M<-M[-Ntip(trees[[1]])-1,] ## trim root node from M
E<-matrix(NA,nrow(M),ncol(M),dimnames=dimnames(M))
for(i in 1:ncol(M)) for(j in 1:nrow(M))
E[j,i]<-trees[[i]]$edge.length[which(trees[[i]]$edge[,2]==M[j,i])]
print(E,digits=3)
## t[[1]] t[[2]] t[[3]] t[[4]] t[[5]] t[[6]] t[[7]] t[[8]] t[[9]]
## H 41.507 38.43 40.89 40.93 35.380 36.741 40.698 36.70 39.46
## I 41.507 38.43 40.89 40.93 35.380 36.741 40.698 36.70 39.46
## F 25.372 25.51 22.98 24.74 23.830 27.529 25.377 26.37 25.94
## E 25.372 25.51 22.98 24.74 23.830 27.529 25.377 26.37 25.94
## G 34.043 31.93 31.06 32.62 29.796 33.251 33.857 31.85 32.47
## A 28.602 33.14 29.13 29.46 28.879 25.782 32.828 26.38 26.89
## C 3.499 3.74 2.98 1.27 1.862 0.522 1.192 2.49 5.84
## B 3.499 3.74 2.98 1.27 1.862 0.522 1.192 2.49 5.84
## D 36.666 38.67 36.18 37.60 39.167 37.088 35.332 36.74 37.25
## J 97.374 94.51 92.68 90.95 93.051 97.177 91.018 94.77 91.71
## U 12.275 16.01 15.30 16.34 14.012 13.259 15.002 13.60 11.46
## V 5.248 4.90 4.79 3.60 3.848 4.543 9.768 6.24 4.03
## W 5.248 4.90 4.79 3.60 3.848 4.543 9.768 6.24 4.03
## Z 7.952 12.70 13.57 13.49 8.555 13.479 10.517 12.95 8.01
## Y 5.860 12.70 13.57 10.00 7.919 13.479 10.517 10.32 8.01
## X 5.860 12.70 13.57 10.00 7.919 13.479 10.517 10.32 8.01
## T 77.345 78.97 78.68 79.36 77.708 80.346 80.827 81.22 78.95
## O 33.773 30.63 32.62 32.31 29.050 29.524 31.983 35.08 35.77
## P 29.987 25.67 30.19 25.47 28.086 28.537 25.924 29.73 26.43
## R 18.898 15.71 16.82 15.43 16.459 13.036 12.350 15.81 15.68
## Q 18.898 15.71 16.82 15.43 16.459 13.036 12.350 15.81 15.68
## K 21.646 19.60 18.16 19.51 22.487 18.644 18.263 19.11 19.64
## L 21.646 19.60 18.16 19.51 22.487 18.644 18.263 19.11 19.64
## M 29.576 24.77 29.13 25.65 27.042 24.809 28.559 22.98 24.85
## N 29.576 24.77 29.13 25.65 27.042 24.809 28.559 22.98 24.85
## S 66.926 60.54 69.51 66.75 64.339 70.155 69.758 62.07 62.50
## 28 2.466 5.65 9.39 2.68 3.798 5.466 12.050 5.17 7.41
## 29 6.992 8.90 7.17 4.12 8.859 6.314 5.507 7.45 7.40
## 30 19.058 18.65 18.28 18.35 22.574 18.007 16.631 18.48 16.88
## 31 29.817 28.53 26.34 27.56 26.239 36.115 28.183 32.14 27.96
## 32 37.280 35.04 36.17 35.86 31.822 39.605 35.023 36.99 34.95
## 33 8.672 6.42 8.08 7.88 5.966 5.722 8.480 5.47 6.53
## 34 53.716 46.94 49.33 49.24 45.026 53.775 50.179 50.58 47.05
## 35 8.064 5.53 7.05 8.13 10.288 11.306 2.504 10.35 10.36
## 36 25.102 29.41 26.15 28.20 27.016 25.259 31.636 23.89 21.05
## 37 19.982 21.20 17.92 14.27 19.142 20.337 17.424 18.72 20.17
## 38 2.513 0.00 5.46 0.00 0.000 1.960 4.817 0.00 0.00
## 39 64.567 56.63 60.85 59.94 60.023 61.396 65.656 61.40 64.04
## 40 0.502 6.32 2.54 3.07 3.672 5.691 0.169 6.22 3.46
## 41 7.027 11.11 10.51 12.74 10.165 8.716 5.234 7.36 7.43
## 42 4.825 9.63 4.27 5.93 9.130 5.471 4.654 6.87 6.90
## 43 2.093 0.00 0.00 3.49 0.636 0.000 0.000 2.64 0.00
## 44 12.931 18.43 14.63 12.61 13.368 12.151 15.886 19.15 16.45
## 45 24.173 21.19 26.63 23.24 26.666 27.403 22.921 20.84 20.47
## 46 8.980 8.72 10.26 11.20 8.623 13.228 14.854 6.15 6.27
## 47 3.786 4.96 2.43 6.84 0.964 0.987 6.059 5.36 9.34
## 48 11.089 9.96 13.37 10.04 11.627 15.501 13.574 13.92 10.75
## 49 6.915 8.75 11.83 12.77 5.908 8.559 13.627 10.35 10.51
## 50 14.191 11.00 12.89 11.23 9.278 15.549 14.946 11.77 11.88
## 51 6.261 5.83 1.92 5.09 4.723 9.384 4.651 7.91 6.67
Neat.
Of course, if we want we could get the average edge lengths for corresponding
edges across the set of trees as follows:
edge.length<-rowMeans(E)
ii<-sapply(trees[[1]]$edge[,2],function(x,y) which(y==x),y=M[,1])
tree<-trees[[1]]
tree$edge.length<-edge.length[ii]
plotTree(tree)
We can see that when we average the edge lengths from a set of topologically
identical ultrametric trees, the resulting tree is also ultrametric.
And, to see that this is indeed the average, let's overlay it on our plot
made with densityTree
, this time using a square phylogram
plotting style:
densityTree(trees,nodes="centered",compute.consensus=FALSE)
par(fg="transparent")
plotTree(tree,nodes="centered",mar=par()$mar,
direction="leftwards",xlim=get("last_plot.phylo",
envir=.PlotPhyloEnv)$x.lim,add=TRUE,
color=make.transparent("darkgrey",0.5),lwd=6,ftype="i")
par(fg="black")
The 'average' tree is show in grey, whereas our sample of trees are given
in blue.
Note that it is also possible to do this automatically (that is, obtaining
the average values of corresponding edges from a set of trees) using
the phytools function consensus.edges
.
I hope this addresses the user's question.
For this exercise I simulated my set of trees as follows:
tree<-pbtree(n=26,tip.label=LETTERS,scale=100)
trees<-list()
for(i in 1:9){
nodes<-sample(1:tree$Nnode+Ntip(tree),10)
trees[[i]]<-tree
trees[[i]]$edge.length<-abs(tree$edge.length+
rnorm(n=nrow(tree$edge),sd=3))
trees[[i]]<-force.ultrametric(trees[[i]])
for(j in 1:length(nodes))
trees[[i]]<-untangle(rotate(trees[[i]],nodes[j]),"read.tree")
}
class(trees)<-"multiPhylo"
although I assume that in the empirical case they would come from my
inference method.
That's it.