Tuesday, March 31, 2015

Converting a phylogeny with node labels to a taxonomy

An R-sig-phylo query asked:

“I wondered if anyone had coded a method for converting a phylogenetic tree with polytomies and node labels into a taxonomy (in some form of data table).”

So far as I know, this has not been done - but nor is it very hard, at least in the highly hypothetical circumstances in which every taxonomic level (i.e., Order, Family, Genus, etc.) that is desired in our table is labeled using node labels and every path from the root to any tip has the same number of labels.

Here's a quick demo, using a balanced tree - although it could equally well be a polytomous tree, or a stochastic tree, so long as the aforementioned rule holds.

library(phytools)
## first here is our tree
tree<-stree(n=64,type="balanced")
tree$node.label<-rep("",tree$Nnode)
tree$node.label[1]<-"Order_1"
tree$node.label[66-Ntip(tree)]<-"Family_1"
tree$node.label[97-Ntip(tree)]<-"Family_2"
tree$node.label[68-Ntip(tree)]<-"Genus_1"
tree$node.label[75-Ntip(tree)]<-"Genus_2"
tree$node.label[83-Ntip(tree)]<-"Genus_3"
tree$node.label[90-Ntip(tree)]<-"Genus_4"
tree$node.label[99-Ntip(tree)]<-"Genus_5"
tree$node.label[106-Ntip(tree)]<-"Genus_6"
tree$node.label[114-Ntip(tree)]<-"Genus_7"
tree$node.label[121-Ntip(tree)]<-"Genus_8"
plotTree(tree,fsize=0.65,ftype="i",lwd=1,xlim=c(-0.06,1.1))
nodelabels(node=which(tree$node.label!="")+Ntip(tree),
    text=tree$node.label[which(tree$node.label!="")],cex=0.7)

plot of chunk unnamed-chunk-1

Now here is the code to pull out the taxonomy:

getAncestors<-phytools:::getAncestors
foo<-function(tip,tree){
    node<-which(tree$tip.label==tip)
    a<-tree$node.label[getAncestors(tree,node,"all")-Ntip(tree)]
    a<-a[length(a):1]
    c(a[a!=""],tip)
}
T<-t(sapply(tree$tip.label,foo,tree=tree))
T
##     [,1]      [,2]       [,3]      [,4] 
## t1  "Order_1" "Family_1" "Genus_1" "t1" 
## t2  "Order_1" "Family_1" "Genus_1" "t2" 
## t3  "Order_1" "Family_1" "Genus_1" "t3" 
## t4  "Order_1" "Family_1" "Genus_1" "t4" 
## t5  "Order_1" "Family_1" "Genus_1" "t5" 
## t6  "Order_1" "Family_1" "Genus_1" "t6" 
## t7  "Order_1" "Family_1" "Genus_1" "t7" 
## t8  "Order_1" "Family_1" "Genus_1" "t8" 
## t9  "Order_1" "Family_1" "Genus_2" "t9" 
## t10 "Order_1" "Family_1" "Genus_2" "t10"
## t11 "Order_1" "Family_1" "Genus_2" "t11"
## t12 "Order_1" "Family_1" "Genus_2" "t12"
## t13 "Order_1" "Family_1" "Genus_2" "t13"
## t14 "Order_1" "Family_1" "Genus_2" "t14"
## t15 "Order_1" "Family_1" "Genus_2" "t15"
## t16 "Order_1" "Family_1" "Genus_2" "t16"
## t17 "Order_1" "Family_1" "Genus_3" "t17"
## t18 "Order_1" "Family_1" "Genus_3" "t18"
## t19 "Order_1" "Family_1" "Genus_3" "t19"
## t20 "Order_1" "Family_1" "Genus_3" "t20"
## t21 "Order_1" "Family_1" "Genus_3" "t21"
## t22 "Order_1" "Family_1" "Genus_3" "t22"
## t23 "Order_1" "Family_1" "Genus_3" "t23"
## t24 "Order_1" "Family_1" "Genus_3" "t24"
## t25 "Order_1" "Family_1" "Genus_4" "t25"
## t26 "Order_1" "Family_1" "Genus_4" "t26"
## t27 "Order_1" "Family_1" "Genus_4" "t27"
## t28 "Order_1" "Family_1" "Genus_4" "t28"
## t29 "Order_1" "Family_1" "Genus_4" "t29"
## t30 "Order_1" "Family_1" "Genus_4" "t30"
## t31 "Order_1" "Family_1" "Genus_4" "t31"
## t32 "Order_1" "Family_1" "Genus_4" "t32"
## t33 "Order_1" "Family_2" "Genus_5" "t33"
## t34 "Order_1" "Family_2" "Genus_5" "t34"
## t35 "Order_1" "Family_2" "Genus_5" "t35"
## t36 "Order_1" "Family_2" "Genus_5" "t36"
## t37 "Order_1" "Family_2" "Genus_5" "t37"
## t38 "Order_1" "Family_2" "Genus_5" "t38"
## t39 "Order_1" "Family_2" "Genus_5" "t39"
## t40 "Order_1" "Family_2" "Genus_5" "t40"
## t41 "Order_1" "Family_2" "Genus_6" "t41"
## t42 "Order_1" "Family_2" "Genus_6" "t42"
## t43 "Order_1" "Family_2" "Genus_6" "t43"
## t44 "Order_1" "Family_2" "Genus_6" "t44"
## t45 "Order_1" "Family_2" "Genus_6" "t45"
## t46 "Order_1" "Family_2" "Genus_6" "t46"
## t47 "Order_1" "Family_2" "Genus_6" "t47"
## t48 "Order_1" "Family_2" "Genus_6" "t48"
## t49 "Order_1" "Family_2" "Genus_7" "t49"
## t50 "Order_1" "Family_2" "Genus_7" "t50"
## t51 "Order_1" "Family_2" "Genus_7" "t51"
## t52 "Order_1" "Family_2" "Genus_7" "t52"
## t53 "Order_1" "Family_2" "Genus_7" "t53"
## t54 "Order_1" "Family_2" "Genus_7" "t54"
## t55 "Order_1" "Family_2" "Genus_7" "t55"
## t56 "Order_1" "Family_2" "Genus_7" "t56"
## t57 "Order_1" "Family_2" "Genus_8" "t57"
## t58 "Order_1" "Family_2" "Genus_8" "t58"
## t59 "Order_1" "Family_2" "Genus_8" "t59"
## t60 "Order_1" "Family_2" "Genus_8" "t60"
## t61 "Order_1" "Family_2" "Genus_8" "t61"
## t62 "Order_1" "Family_2" "Genus_8" "t62"
## t63 "Order_1" "Family_2" "Genus_8" "t63"
## t64 "Order_1" "Family_2" "Genus_8" "t64"

In this case, that's all there is to it. If the number of labels is different in different paths then we will end up with something much messier. For instance:

tree$node.label[122-Ntip(tree)]<-"Subgenus_1"
tree$node.label[125-Ntip(tree)]<-"Subgenus_2"
plotTree(tree,fsize=0.65,ftype="i",lwd=1,xlim=c(-0.06,1.1))
nodelabels(node=which(tree$node.label!="")+Ntip(tree),
    text=tree$node.label[which(tree$node.label!="")],cex=0.7)

plot of chunk unnamed-chunk-3

T<-sapply(tree$tip.label,foo,tree=tree)
T
## $t1
## [1] "Order_1"  "Family_1" "Genus_1"  "t1"      
## 
## $t2
## [1] "Order_1"  "Family_1" "Genus_1"  "t2"      
## 
## $t3
## [1] "Order_1"  "Family_1" "Genus_1"  "t3"      
## 
## $t4
## [1] "Order_1"  "Family_1" "Genus_1"  "t4"      
## 
## $t5
## [1] "Order_1"  "Family_1" "Genus_1"  "t5"      
## 
## $t6
## [1] "Order_1"  "Family_1" "Genus_1"  "t6"      
## 
## $t7
## [1] "Order_1"  "Family_1" "Genus_1"  "t7"      
## 
## $t8
## [1] "Order_1"  "Family_1" "Genus_1"  "t8"      
## 
## $t9
## [1] "Order_1"  "Family_1" "Genus_2"  "t9"      
## 
## $t10
## [1] "Order_1"  "Family_1" "Genus_2"  "t10"     
## 
## $t11
## [1] "Order_1"  "Family_1" "Genus_2"  "t11"     
## 
## $t12
## [1] "Order_1"  "Family_1" "Genus_2"  "t12"     
## 
## $t13
## [1] "Order_1"  "Family_1" "Genus_2"  "t13"     
## 
## $t14
## [1] "Order_1"  "Family_1" "Genus_2"  "t14"     
## 
## $t15
## [1] "Order_1"  "Family_1" "Genus_2"  "t15"     
## 
## $t16
## [1] "Order_1"  "Family_1" "Genus_2"  "t16"     
## 
## $t17
## [1] "Order_1"  "Family_1" "Genus_3"  "t17"     
## 
## $t18
## [1] "Order_1"  "Family_1" "Genus_3"  "t18"     
## 
## $t19
## [1] "Order_1"  "Family_1" "Genus_3"  "t19"     
## 
## $t20
## [1] "Order_1"  "Family_1" "Genus_3"  "t20"     
## 
## $t21
## [1] "Order_1"  "Family_1" "Genus_3"  "t21"     
## 
## $t22
## [1] "Order_1"  "Family_1" "Genus_3"  "t22"     
## 
## $t23
## [1] "Order_1"  "Family_1" "Genus_3"  "t23"     
## 
## $t24
## [1] "Order_1"  "Family_1" "Genus_3"  "t24"     
## 
## $t25
## [1] "Order_1"  "Family_1" "Genus_4"  "t25"     
## 
## $t26
## [1] "Order_1"  "Family_1" "Genus_4"  "t26"     
## 
## $t27
## [1] "Order_1"  "Family_1" "Genus_4"  "t27"     
## 
## $t28
## [1] "Order_1"  "Family_1" "Genus_4"  "t28"     
## 
## $t29
## [1] "Order_1"  "Family_1" "Genus_4"  "t29"     
## 
## $t30
## [1] "Order_1"  "Family_1" "Genus_4"  "t30"     
## 
## $t31
## [1] "Order_1"  "Family_1" "Genus_4"  "t31"     
## 
## $t32
## [1] "Order_1"  "Family_1" "Genus_4"  "t32"     
## 
## $t33
## [1] "Order_1"  "Family_2" "Genus_5"  "t33"     
## 
## $t34
## [1] "Order_1"  "Family_2" "Genus_5"  "t34"     
## 
## $t35
## [1] "Order_1"  "Family_2" "Genus_5"  "t35"     
## 
## $t36
## [1] "Order_1"  "Family_2" "Genus_5"  "t36"     
## 
## $t37
## [1] "Order_1"  "Family_2" "Genus_5"  "t37"     
## 
## $t38
## [1] "Order_1"  "Family_2" "Genus_5"  "t38"     
## 
## $t39
## [1] "Order_1"  "Family_2" "Genus_5"  "t39"     
## 
## $t40
## [1] "Order_1"  "Family_2" "Genus_5"  "t40"     
## 
## $t41
## [1] "Order_1"  "Family_2" "Genus_6"  "t41"     
## 
## $t42
## [1] "Order_1"  "Family_2" "Genus_6"  "t42"     
## 
## $t43
## [1] "Order_1"  "Family_2" "Genus_6"  "t43"     
## 
## $t44
## [1] "Order_1"  "Family_2" "Genus_6"  "t44"     
## 
## $t45
## [1] "Order_1"  "Family_2" "Genus_6"  "t45"     
## 
## $t46
## [1] "Order_1"  "Family_2" "Genus_6"  "t46"     
## 
## $t47
## [1] "Order_1"  "Family_2" "Genus_6"  "t47"     
## 
## $t48
## [1] "Order_1"  "Family_2" "Genus_6"  "t48"     
## 
## $t49
## [1] "Order_1"  "Family_2" "Genus_7"  "t49"     
## 
## $t50
## [1] "Order_1"  "Family_2" "Genus_7"  "t50"     
## 
## $t51
## [1] "Order_1"  "Family_2" "Genus_7"  "t51"     
## 
## $t52
## [1] "Order_1"  "Family_2" "Genus_7"  "t52"     
## 
## $t53
## [1] "Order_1"  "Family_2" "Genus_7"  "t53"     
## 
## $t54
## [1] "Order_1"  "Family_2" "Genus_7"  "t54"     
## 
## $t55
## [1] "Order_1"  "Family_2" "Genus_7"  "t55"     
## 
## $t56
## [1] "Order_1"  "Family_2" "Genus_7"  "t56"     
## 
## $t57
## [1] "Order_1"    "Family_2"   "Genus_8"    "Subgenus_1" "t57"       
## 
## $t58
## [1] "Order_1"    "Family_2"   "Genus_8"    "Subgenus_1" "t58"       
## 
## $t59
## [1] "Order_1"    "Family_2"   "Genus_8"    "Subgenus_1" "t59"       
## 
## $t60
## [1] "Order_1"    "Family_2"   "Genus_8"    "Subgenus_1" "t60"       
## 
## $t61
## [1] "Order_1"    "Family_2"   "Genus_8"    "Subgenus_2" "t61"       
## 
## $t62
## [1] "Order_1"    "Family_2"   "Genus_8"    "Subgenus_2" "t62"       
## 
## $t63
## [1] "Order_1"    "Family_2"   "Genus_8"    "Subgenus_2" "t63"       
## 
## $t64
## [1] "Order_1"    "Family_2"   "Genus_8"    "Subgenus_2" "t64"

Perhaps we could still post-process this - but it would obviously be more difficult. Note that there is no difficult if genera are at different temporal depths in the tree - so longer as there is the same number of named levels between the root & any tip.

That's all.

No comments:

Post a Comment