Sunday, November 3, 2013

New function to add species to a genus in a phylogeny

Recently an R-sig-phylo subscriber made the following request:

"I need to add a list of 230 species in a phylogenetic tree. Is there a logical way to add the species to the root of the genera to which it belongs, in a systematic way, that is, to make a function recognizes the name of the genera to which the species belongs, and so it adds the species to that root?"

I posted a solution to this, but I also thought that other phytools users might face the same problem so I have now automated this in the function add.species.to.genus. This, along with the matrix comparison function skewers, is also in a new minor phytools build (phytools 0.3-73).

The function works by first peeling the genus name out of the species name. It does this by looking for either the underscore character, "_", or a space character, " ". It then proceeds to identify the clade containing con-generics in the tree by matching the genus name to the tip labels. Finally, it can either attach the new species to the root node of the most inclusive clade containing congenerics; or it can attach the new species randomly within that clade. In general, the function works best if the input tree is ultrametric. Otherwise, it may return a tree without edge lengths!

Here's a quick demo:

> ## load phytools
> library(phytools)
Loading required package: ape
Loading required package: maps
Loading required package: rgl
> packageVersion("phytools")
[1] ‘0.3.73’
> ## here's our starting tree
> plotTree(tree,ftype="i")
> ## add a species to 'Genus2' at root of genus
> species<-"Genus2_sp3"
> t1<-add.species.to.genus(tree,species)
> plotTree(t1,ftype="i")
> ## add a species to 'Genus4' randomly in genus
> species<-"Genus4 sp7"
> t2<-add.species.to.genus(tree,species,where="random")
> plotTree(t2,ftype="i")
> ## add a set of species in a vector
> species<-c("Genus1_sp2", "Genus2_sp3", "Genus3_sp2", "Genus4_sp7", "Genus4_sp8", "Genus5_sp3")
> for(i in 1:length(species)) tree<-add.species.to.genus(tree,species[i],where="random")
> plotTree(tree,ftype="i")

If the user supplies a species name for a genus with only one representative in the tree; a species with no congeners in the tree; or a genus that is non-monophyletic, in each case the function will try to do something rational and return a warning.

That's it.

12 comments:

  1. Hi Liam,
    I am trying to use the add.species.to.genus function, but I am having trouble getting the correct result. The function is only adding the last species of a vector to the phylogeny. Here is my code below and any comment(s) that you can give would be great.
    Thanks,
    Chris

    ## (((Frankliniella_tritici:1.0,Frankliniella_occidentalis:1.0):1.0,Echinothrips_americanus:2.0):1.0,Thrips_tabaci:3.0);##
    chronotree<-read.tree("test_spp_tre.tree")#load tree
    species<-c("Frankliniella_A","Thrips_A","Echinothrips_A")#spp to add to chrono
    for(i in 1:length(species)) tree5<-add.species.to.genus(chronotree,species[i],where="root")#add spp to chrono

    ReplyDelete
    Replies
    1. Hi Chris. This is because your for loop is written incorrectly. It should be:

      for(i in 1:length(species)) chronotree<-add.species.to.genus(chronotree,species[i],
      where="root")
      plotTree(chronotree)

      Otherwise you will not be updating the tree every loop.

      - Liam

      Delete
  2. Thanks Liam, I appreciate your help!

    ReplyDelete
  3. In case other people find it useful, here's a function for adding genus names to the node uniquely containing all members of that genus:

    for(genus in unique(sub("_.*", "",tree$tip.label))) {
    genusRE <- paste("^", genus, "_", sep="")
    species<-grep(genusRE,tree$tip.label)
    if (length(species)>=2) {
    mrca<-findMRCA(tree,tree$tip.label[species])
    desc <- getDescendants(tree, mrca)
    desc_tips <- desc[desc <= length(tree$tip.label)]
    if (length(grep(genusRE, tree$tip.label[desc_tips], invert = TRUE))==0) {
    base_node <- mrca - length(tree$tip.label)
    if (is.null(tree$node.label[base_node]) || is.na(tree$node.label[base_node])) {
    tree$node.label[base_node] = genus
    }
    }
    }
    }

    ReplyDelete
  4. Great function!

    Would it be possible to - instead of inserting taxa inside the clade, i.e. above the empirical MCRA - could one define a maximum age above which new taxa could be inserted?

    Reason I ask is, if you have a genus that is grossly under-sampled and the few species you have are closely related, the MRCA will be quite young and all inserted taxa will be crammed into this short time-frame, which might be unrealistic.

    Thanks!

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete
  6. Hi Liam,
    I am working with a Large phylo object that I retrieved from OpenToL. After some edition deleting the "Open Tree of Life Taxonomy identifiers" (ott ids) to leave the species names only, I read the tree in R using "read.newick()" with no problems. Then I tried the add.species.to.genus() with my database, but in doing so I get the following error message:

    Error in read.tree(text = write.tree(tree)) :
    The tree has apparently singleton node(s): cannot read tree file.
    Reading Newick file aborted at tree no. 1
    In addition: There were 50 or more warnings (use warnings() to see the first 50)

    If I understand well "add.species.to.genus" uses read.tree() to read the tree instead of the read.newick() function, hence singletons in my original tree become a problem at this point. Am I right? If so, what should I do? Should I edit my tree to becorrectly readed by read.tree() first? or there is a way to modify arguments in add.species.to.genus() to use read.newick() instead?

    Btw: excelent blog, thanks!

    ReplyDelete
    Replies
    1. Hmmm. You could try:

      read.tree<-read.newick
      ## run your code
      rm(read.tree)

      Not sure, but that might work.

      Delete
  7. Hi Liam -

    I've been using this function quite a bit lately, but have run into an issue with my latest dataset. I'm iteratively adding about 20 species from different genera into an ultrametric tree, and for some reason it keeps spitting out a non-ultrametric tree about halfway through the loop, which then leads to a tree with no branch lengths at all in the next iteration. I tried to track it down, and it seems that in some instances, the internal splitTree function returns a subtree with branch lengths that differ just above the default tolerance for 'is.ultrametric', which is also used internally. Thus it thinks the whole tree is non-ultrametric, and adds the new tip without the proper consideration. Do you have any ideas on how to combat this behavior?? It could be simple enough just to specify a slightly larger tolerance on the internal use of is.ultrametric, but I'm not sure how to do that...

    Thanks in advance!!

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
  8. Hi Ryan and Liam,

    I am having the same problem (If I understood your). I am adding some species in an a tree (that I transformed in ultrametric before), but in some moment the function returns a error message that the tree is not ultrametric.I don't know how I can solve this.

    ReplyDelete