Thursday, August 22, 2013

Converting a taxonomy to a phylogeny & the problem of branch lengths

A recent R-SIG-phylo subscriber asked:
"I'm trying to build a taxonomic tree from a taxonomic matrix using the function as.phylo.formula of the ape package. It works well but it returns a phylo object without branch length. This is problematic in that it can lead to misleading graphic representation (eg with plot.phylo). For example a dichotomy between two species belonging to the same genus may occur at the same y-coordinate than a dichotomy between two species of different genus."

This turns out to be the case because as.phylo.formula (in the ape package) first uses an algorithm that converts the input taxonomy into a Newick string - specifically without singleton nodes (i.e., nodes with only one descendant) for monogeneric families, monotypic genera, etc. It then reads the Newick string into memory using ape's read.tree to read this Newick string into memory. Having figured this out, addressing this problem in a modified version of as.phylo.formula (code at the end of my email, here - corrected here) was easy. First, I pulled the line of code that prevented the parsing function from creating singleton nodes in the Newick string. Second, I read the tree into memory with the phytools function read.newick, which permits singleton nodes. Third, I assigned unit branch lengths to all edges in this tree with singleton nodes, which guarantees that all internal nodes with the same taxonomic level (e.g., Family) have the same height above the root. Finally, I collapsed all singletons.

Here is a demo of the original function & the modified version:

> library(ape) ## load ape
> data(carnivora)
> t1<-as.phylo.formula(~SuperFamily/Family/Genus/Species, data=carnivora)
> par(lend=2)
> plot(t1,edge.width=2,cex=0.6,no.margin=TRUE)

Right away we can see the problem - nodes at the same taxonomic level are not necessarily at the same height above the root. There's two reasons for this: (1) some taxonomic levels have only one group in the level nested below it (for instance, families with only one genus; genera with only one species, etc.); and (2) compute.brlen, which is used internally by plot.phylo to determine the length of plotted edges with no edge lengths are given, uses the number of descendant leaves to determine the length of edges in the tree - and not all genera (and families, etc.) have the same number of descendant species.

Here is a demo of the modified version that I posted to the listserve:

> source("as.phylo.formula.R")
> t2<-as.phylo.formula(~SuperFamily/Family/Genus/Species, data=carnivora)
> plot(t2,edge.width=2,cex=0.6,no.margin=TRUE)

If we don't like how this looks then in theory we could use a different formula to set our branch lengths, but at least this plot has the property that all nodes at the same taxonomic level have the same height above the root. Cool.

2 comments:

  1. Hi Liam,

    Thanks for your explanation. Links to your example code are broken. Would you mind re-posting?

    Thanks in advance.

    ReplyDelete
  2. Hi Liam,

    Thanks for your explanation. Links to your example code are broken. Would you mind re-posting?

    Thanks in advance.

    ReplyDelete