Monday, October 29, 2012

Multiple matrix regression with Mantel permutations

An R user asks:
"whether there is in the world a script (function) in R for multi_mantel: multiple matrix regression and hypothesis testing"

The program for multiple matrix regression that the user is referring to is my old C program, "multi_mantel", that I have not used in quite some time (and try my hardest not to support).

When I programmed this years ago (i.e., circa 2006) it was a considerable undertaking. In R, it should be a much easier task. This is because we can take advantage of existing 'base' and 'stats' package functions such as lm and summary.lm (not to mention all the other many conveniences that are afforded by the R environment).

To prove this (on a rainy, windy day here in Boston) I just did so. A link to the function code is here.

The way this works is really straightforward. The function first unfolds our input matrices (which can also be objects of class "dist", and then performs multivariable regression using lm. (Note that if we have multiple independent matrices, we should give them in a list.) Next, it randomizes rows & columns together in the dependent matrix, following Mantel (1967), each time recomputing the test statistics (in this case, F and the t-statistics for each coefficient in the model.) Finally, it compares the observed F & t-values to the null distributions obtained by permutation.

The function internally calls two other custom functions (unfoldLower & foldtoLower) that convert a symmetric distance matrix to a vector; or, conversely, that converts a vector to a symmetric distance matrix.

Here's an example use. Note that I don't endorse the Mantel test for estimating phylogenetic signal. According to Harmon & Glor (2010) this is generally a bad idea.

> source("multi.mantel.R")
> tree<-pbtree(n=100)
> y<-fastBM(tree,sig2=2)
> Y<-dist(y)
> X<-cophenetic(tree)
> fit<-multi.mantel(Y^2,X,nperm=1000)
> fit[1:6]
$r.squared
[1] 0.0341906
$coefficients
(intercept)          X1
  1.788184    2.207660
$tstatistic
(intercept)          X1
  1.600743   13.234956
$fstatistic
[1] 175.1641
$probt
(intercept)          X1
     1.000       0.001
$probF
[1] 0.001

Note that the regression slope (fit$coefficients["X1"] is also an estimate of the evolutionary rate in this case - although it is not a very good one. Compare the fit & P-value here to the case in which we have data with no phylogenetic signal:

> require(geiger)
> y<-fastBM(lambdaTree(tree,0))
> Y<-dist(y)
> fit<-multi.mantel(Y^2,X,nperm=1000)
> fit[1:6]
$r.squared
[1] 0.0002468181
$coefficients
(intercept)          X1
6.50077758  0.09146137
$tstatistic
(intercept)          X1
 11.730152    1.105241
$fstatistic
[1] 1.221558
$probt
(intercept)          X1
     0.666       0.680
$probF
[1] 0.68

Again, I wouldn't necessarily advocate measuring phylogenetic signal this way (if at all), but this example still serves to demonstrate the function nicely.

Finding the MRCA for two or multiple species across a set of trees

A user recently* contacted me with a question about finding the MRCA for two or multiple species across a set of trees in an R "multiPhylo" object. (*Unfortunately, "recently" is running about two weeks behind, so this email was actually sent on 10/15. To that user, I say "sorry!") She nearly had it correct, but there was a small error in how she was using sapply. Here is the correct way, by way of demonstration:

> require(phytools)
Loading required package: phytools
Loading required package: ape
...
> # simulation some random trees
> trees<-rmtree(N=100,n=10)
> sp<-c("t1","t2") # for example
> # now find the MRCA of sp across the set of trees
> ca<-sapply(trees,function(x,sp) findMRCA(x,sp),sp=sp)
> ca
  [1] 11 19 12 12 12 14 15 14 16 13 12 12 11 16 19 14 17 11
[19] 11 11 11 11 11 11 13 18 19 11 15 12 11 19 12 12 11 12
[37] 11 12 13 11 11 12 12 11 18 12 11 11 15 16 11 16 12 12
[55] 11 14 11 11 15 17 11 14 16 19 11 13 18 11 16 13 12 11
[73] 12 14 13 11 13 11 11 14 17 11 12 11 11 11 11 12 11 13
[91] 11 19 19 11 13 12 12 11 11 11
> # now verify (for the first 3 trees)
> par(mfrow=c(3,1),mar=c(1.5,1.5,1.5,1.5))
> plot.phylo(trees[[1]],cex=1.5); nodelabels(cex=1.5)
> title("tree 1",cex.main=2)
> plot.phylo(trees[[2]],cex=1.5); nodelabels(cex=1.5)
> title("tree 2",cex.main=2)
> plot.phylo(trees[[3]],cex=1.5); nodelabels(cex=1.5)
> title("tree 3",cex.main=2)


Cool. It works.

Thursday, October 25, 2012

Version of ancThresh posted

I just posted a version of my function for ancestral character estimation under the threshold model (e.g., 1, 2, 3). Direct link to the code is here. I'm sharing this because there has been some interest since I started posting about the method, but I should note that this is a brand new function & method, and the code is "bleeding edge," so should be used only with caution. (I'm also working on a manuscript describing the approach.)

Tuesday, October 23, 2012

Some more updates to xkcdTree

After discovering an actual xkcd comic featuring a phylogenetic tree, I was determined to incorporate some of the attributes of this comic tree into the function xkcdTree. Specifically, the real xkcd trees had tilted tip labels, a root edge, and an upward orientation (all things that my xkcd trees lacked). All of these features are also popular plotting options which, now that I've figured them out, should be easier to incorporate into other phytools plotting function - such as plotSimmap - so it's not all about plotting cartoon trees!

Tilted labels are easy, we can just use text(...,srt=tilt). However, they do create the problem that, now that the labels are titled, we need a different amount of space on the x & y axes to fit the labels. We can compute these (unsurprisingly) using strwidth and our 6th-grade trig identities, for instance:
dx<-abs(max(strwidth(cw$tip.label,units="figure",cex=fsize, family="xkcd"))*cos(deg2rad(tilt)))
dy<-abs(max(strwidth(cw$tip.label,units="figure",cex=fsize, family="xkcd"))*sin(deg2rad(tilt)))

deg2rad is an internal function for xkcdTree.

To plot a vertical (i.e., upward facing) tree, all we need to do is flip x & y. We can't forget, though, that the space we have calculated for our labels will also need to be transformed - depending on the dimensions we've allocated for our plotting space.

The new version of the function is here, and it will be in the next release of phytools. Let's try it out:
> require(phytools)
Loading required package: phytools
Loading required package: ape
...
> require("extrafont")
Loading required package: extrafont
> tree<-read.tree(text="((mammals,(birds,reptiles)),amphibians);")
> for(i in 1:tree$Nnode+length(tree$tip)) tree<-rotate(tree,node=i)
> tree<-compute.brlen(tree)
> source("xkcdTree.R")
> tree$root.edge<-0.25
> xkcdTree(tree,file="herpetology.pdf",lwd=2,color="black", dim=c(4,4),jitter=0.001,waver=c(0.03,0.03),tilt=-40, right=F)
**** NOTE: use in Windows requires ...
>


Obviously, plotting cartoon trees is a fairly silly exercise - but I have certainly learned a thing or too that will help me as I augment that plotting functions of the phytools package in the future.

Sunday, October 21, 2012

Actual xkcd tree

In all my posts about the function for plotting xkcd style trees (e.g., 1, 2, 3, 4, 5, 6), I somehow overlooked the fact that there is an xkcd comic featuring a phylogenetic tree (more specifically, two trees - although the topological correctedness of the second is debatable). Furthermore, it is one of my favorite xkcd comics of all time:



Who knew that I should have been using this as a guide all along!

In the end, though (and with some tweaking of the settings from their default values, and a little post-hoc editing), my function does a pretty good job:

> require(phytools)
Loading required package: phytools
Loading required package: ape
...
> require("extrafont")
Loading required package: extrafont
> tree<-read.tree(text="((mammals,(birds,reptiles)),amphibians);")
> tree<-compute.brlen(tree)
> xkcdTree(tree,file="herpetology.pdf",lwd=2,color="black", dim=c(4,4),jitter=0.001,waver=c(0.03,0.03))
**** NOTE: use in Windows requires ...
>
That's it!

Tuesday, October 16, 2012

Estimating Pagel's λ for a set of traits in a matrix or data frame

Earlier, I posted about estimating Pagel's λ jointly for a set of traits. A phytools user recently submitted the following comment:

One question I had was how to calculate lambda on multiple traits? I have code that gets me close in that it outputs the lambda values but not the log values or p-value.

This could be done, of course, using a for loop, something like:
> require(phytools); require(geiger)
> # simulate tree & data
> tree<-pbtree(n=100)
> X<-fastBM(lambdaTree(tree,0.7),nsim=10)
> lambda<-matrix(NA,ncol(X),4,dimnames=list(NULL, c("lambda","logL","logL0","P")))
> if(is.data.frame(X)) X<-as.matrix(X)
> for(i in 1:nrow(lambda))
lambda[i,]<-unlist(phylosig(tree,X[,i],method="lambda", test=T))
> lambda
        lambda      logL     logL0            P
[1,] 0.8176128 -176.9442 -199.7361 1.462441e-11
[2,] 0.5321105 -183.5490 -192.2505 3.023510e-05
[3,] 0.7868250 -177.3286 -200.6997 8.096301e-12
[4,] 0.6772734 -182.8100 -198.7038 1.719778e-08
[5,] 0.4893630 -183.6516 -193.9559 5.634189e-06
[6,] 0.7477579 -167.3094 -194.6396 1.432188e-13
[7,] 0.7160706 -182.1853 -207.2614 1.422751e-12
[8,] 0.6029530 -184.5244 -200.5472 1.505981e-08
[9,] 0.6654206 -181.8013 -201.3363 4.088944e-10
[10,] 0.7553931 -182.0996 -200.9483 8.261227e-10

Alternatively, however, we could also do the following using sapply:
> X<-X[tree$tip.label,]
> X<-as.data.frame(X)
> lambda<-t(sapply(X,phylosig,tree=tree,method="lambda", test=T))
[1] "x has no names; assuming x is in the same order as tree$tip.label"
...
> lambda
   lambda    logL      logL0     P          
V1  0.8176128 -176.9442 -199.7361 1.462441e-11
V2  0.5321105 -183.549  -192.2505 3.02351e-05
V3  0.786825  -177.3286 -200.6997 8.096301e-12
V4  0.6772734 -182.81   -198.7038 1.719778e-08
V5  0.489363  -183.6516 -193.9559 5.634189e-06
V6  0.7477579 -167.3094 -194.6396 1.432188e-13
V7  0.7160706 -182.1853 -207.2614 1.422751e-12
V8  0.602953  -184.5244 -200.5472 1.505981e-08
V9  0.6654206 -181.8013 -201.3363 4.088944e-10
V10 0.7553931 -182.0996 -200.9483 8.261227e-10

That's it.

Monday, October 15, 2012

Posterior density for the thresholds during ancestral character estimation using the threshold model

I've been doing more exploration of ancestral character estimation using the threshold model. One thing that I was interested in was how good we can be at estimating the relative positions of the thresholds along our liability axis. Note that since unobserved liabilities have arbitrary scale, only the relative positions of the thresholds have meaning. Specifically, the set of thresholds [1,2,4] with Brownian rate of evolution of the liability of σ2=1.0 has the exact same meaning & interpretation of thresholds [0,2,6] and σ2=2.0.

Nonetheless, since I have simulated liability evolution with a fixed, known σ2 (and it is the same σ2 that I will use for estimation), it makes sense in this rare case to look at the posterior density for the thresholds on its raw scale. I decided to simulate using a four state (thus 3 threshold) model, in which the four states are "blue", "green", "red", and "yellow"; and the threshold transition points between colors are 0 (blue to green), 1 (green to red) and 4 (red to yellow).

Here was my simulation:
> # load source
> source("ancThresh.R")
> # simulate tree
> tree<-pbtree(n=100,scale=10)
> # simulate liability
> l<-fastBM(tree,internal=TRUE,a=2)
> # translate to discrete character
> x<-rep("blue",length(l))
> x[l>0]<-"green"
> x[l>1]<-"red"
> x[l>4]<-"yellow"
> names(x)<-names(l)
> sequence<-c("blue","green","red","yellow")
> cols<-sequence; names(cols)<-sequence
> res<-ancThresh(tree,x[1:length(tree$tip)],seq=sequence, ngen=1000000,control=list(piecol=cols))
MCMC starting....
gen 1000
gen 2000
...

The posterior sample for the thresholds is in res$par. Unlike in the translation, above, each threshold is named for its upper bound (so, blue = 0; green = 1; red = 4; and yellow = Inf, in this case). Obviously, and again because liabilities are unscaled and uncentered, we fix the lowest of these at a constant value - we choose 0, but this is arbitrary.

Now let's do some computations to plot the posterior density:
> breaks<--0.25:25/5
> breaks
[1] -0.05  0.15  0.35  0.55  0.75  0.95  ...
> green<-hist(res$par[,"green"],breaks=breaks)
> red<-hist(res$par[,"red"],breaks=breaks)
> plot(red$mids,red$density,type="s",ylim=c(0, max(c(red$density,green$density))),xlab="liability", ylab="density",col="red")
> lines(red$mids,green$density,type="s",col="green")

I have also added the generating threshold (colored on each side with the appropriate discrete state) for fun. Not too bad.
It should also be noted that if we look at the ratio of the mean estimates of the two thresholds this is also quite close to the generating ratio of four:
> mean(res$par[,"red"])/mean(res$par[,"green"])
[1] 4.3913

That's it.

Sunday, October 14, 2012

Ancestral character estimation under the threshold model, part II

Another quick post about ancestral character estimation under the threshold model from quantitative genetics. For a reminder of the details of this model, please check out my prior posts on the topic, as well as Felsenstein (2012).

The central idea is that this model, in which the value for a discrete character is determined by the evolution of an underlying continuous trait called 'liability,' might be a more appropriate model for many discretely manifested, by complex or polygenic, attributes of organismal phenotype.

I'm estimating under this model by using Bayesian MCMC to sample ancestral and tip liabilities, and the thresholds that result in a change in the discrete character, from their joint posterior probability distribution.

I'm still working on this, but here is an example simulation, analysis, and then some assessments of performance:

> source("ancThresh.R")
> # simulate liabilities
> l<-fastBM(tree,internal=TRUE,sig2=2,a=-1)
> # translate liabilities to threshold character
> x<-rep("blue",length(l))
> x[l>0]<-"green"
> x[l>1]<-"red"
> x[l>6]<-"yellow" > names(x)<-names(l)
> x
    t54      t55      t51 ...
 "blue"   "blue"   "blue" ...
> x11(); res<-ancThresh(tree,x[1:length(tree$tip)],ngen=100000)
**** NOTE: no sequence provided, using alphabetical or numerical order
MCMC starting....
gen 1000
gen 2000
...
> # plot the likelihood profile
> plot(res$par[,"gen"],res$par[,"logLik"],xlab="gen", ylab="logL")
> lines(res$par[,"gen"],res$par[,"logLik"])
> # pull estimates as max PP for each node
> est<-apply(res$ace,1,function(x) names(x)[which(x==max(x))])
> est
    101      102      103 ...
  "red"   "blue"   "blue" ...
> # do they match?
> a<-x[101:199] # get simulated ancestors
> sum(a==est)/length(a)
[1] 0.7878788 # about 79% match
> # what if we restrict to >0.5, 0.7, or 0.9 PP?
> any50<-apply(res$ace,1,function(x) any(x>0.5))
> sum(any50)/tree$Nnode # gives us the % nodes with PP>0.5
[1] 0.969697
> # now the fraction that match, given PP>0.5
> sum(a[any50]==est[any50])/sum(any50)
[1] 0.7916667
> # PP > 0.7
> any70<-apply(res$ace,1,function(x) any(x>0.7))
> sum(any70)/tree$Nnode
[1] 0.5050505
> sum(a[any70]==est[any70])/sum(any70)
[1] 0.96
> # 96% of nodes with PP>0.7 are correct
> # PP > 0.9
> any90<-apply(res$ace,1,function(x) any(x>0.9))
> sum(any90)/tree$Nnode
[1] 0.2929293
> sum(a[any90]==est[any90])/sum(any90)
[1] 1

Cool. I will post the code & more on this very soon.

Friday, October 12, 2012

ProjectEvoMap

I just added the Revell lab at UMass-Boston to ProjectEvoMap. ProjectEvoMap is, in the words of its creator, "a resource where evolutionary biologists can find info on labs and groups from all around the world." Here is a screenshot from the Revell lab entry:



Evidently, we are the first lab added from the greater Boston area. (Many more have already been posted from around the country and world.) Click on the image to go to ProjectEvoMap (centered on us, naturally).

Ed. note (Oct. 20, 2012): ProjectEvoMap has moved to http://projectevomap.yolasite.com. Check it out.

Thursday, October 11, 2012

Ancestral character estimation under the threshold model

Since posting about it on the anoleannals.org blog, I have been working on a first pass at ancestral character estimation under the threshold model.

Under the threshold model, the phenotypic trait of interest has a discretely valued presentation - for instance, color or habitat*; but a continuous, unmeasurable, underlying liability. (*Note, of course, that a characteristic like habitat might be judged by measuring a number of quantitative features about the environment - this is not a serious problem.) When the liability exceeds some threshold value, the discretely valued state of the observable character trait changes. So, for instance, the figure below (which I already used in a previous post) shows evolution of liability on a phylogenetic tree. When the liability crosses a threshold, the observable trait (here shown as a set of colors) changes state.
For more details on the threshold model, which derives originally from quantitative genetics, I would encourage you to check out Felsenstein (2012) and/or my prior posts on the subject (1, 2, 3).

The model provides several advantages. One is that it may be inherently more realistic than the discrete Markov model that we typically use for discretely-valued character traits. This model is the same as the "nucleotide model" that we use for DNA sequence evolution in phylogeny inference, and thus may not be well suited to complex traits typical of organismal phenotypes.

OK, so in principle, applying the threshold model to ancestral character estimation shouldn't be that different from applying it other problems. If we are to use Bayesian MCMC (as I have done in this case), we should just sample the liabilities for the tips of the tree and internal nodes; compute their probability under our model for evolution of the liabilities; and then multiply this probability by 1.0 if the liabilities predict our observed phenotypes, and 0.0 otherwise. Since it is not too interesting to reconstruct a trait that only has a binary presentation, I have also added MCMC sampling of the "positions" of the thresholds. (Our liabilities are unobserved, and thus scaleless, so these positions can only be interpreted in relative terms.)

Unfortunately, I have to run - so I will post more about this later, but below I've plotted the results from a single MCMC run of this method. The colored circles at the tips are the tip states for the discrete character, and the pie graphs at internal nodes represent the posterior probability of the internal nodes between in each of the four observed states, under the threshold model.

> source("ancThresh.R")
> tree<-pbtree(n=100)
> l<-fastBM(tree,internal=TRUE) # simulate liability
> # discretize!
> x<-rep("blue",length(l))
> x[l>0]<-"green"
> x[l>1]<-"red"
> x[l>2]<-"yellow"
> names(x)<-names(l)
> # run the MCMC
> result<-ancThresh(tree,x[1:length(tree$tip)],ngen=100000, control=list(sample=1000))
**** NOTE: no sequence provided, using alphabetical or numerical order
[1] "gen 1000"
[1] "gen 2000"
...

Pretty cool! I will post a lot more on this very soon.

Wednesday, October 10, 2012

New test version of phytools (v0.2-02) with xkcdTree and new version of pgls.Ives

I just posted a new non-CRAN version of phytools (version 0.2-02). To install, just download phytools_0.2-02.tar.gz and install from source:

> install.packages("phytools_0.2-02.tar.gz",type="source", repos=NULL)
...
* installing *source* package 'phytools' ...
...
* DONE (phytools)
> require(phytools)
Loading required package: phytools
Loading required package: ape
...

With respect to phytools v0.2-01 this version has two additions:

1) An update to pgls.Ives. pgls.Ives implements the method of Ives et al. (2007) for phylogenetic regression with error in the estimation of species means. Basically, I was discovering that sometimes optimization failed due to the (arbitrarily specified) lower bounds required for σy2 and σx2 by the optimization method, optim(...,method="L-BFGS-B"). I have now decreased the default value for these bounds (by a factor of 10e-4) and also allow user control. Other aspects of control of optim are still not under user control, so I will change that. The function now also returns convergence and message from optim (basically optim's report on whether or not it thinks it has converged).

2) I have added the function xkcdTree to the phytools package. I documented xkcdTree in several prior posts (1, 2, 3, 4, 5) and I won't repeat that documentation here. Suffice it to say that the function plots a phylogenetic tree in the style of xkcd style graph (in other words, hand drawn with a particular all-caps style of font). I decided to develop this function on a whim and without any real intention of adding it to phytools, but it generated a huge amount of traffic to my blog (with many visitors coming from several a number of tweets that my posts on this function stimulated, see below).


Consequently, I have added the function to phytools with the following caveats:
a) phytools users wanting to plot xkcd style trees with phytools will first need to download and install the font 'xkcd.ttf' (just search for it);
b) users will also have to download and install the CRAN package extrafont and its dependencies - phytools will not load these dependencies of xkcdTree automatically; and
c) at least Windows users of R will have to download and install Ghostscript - this is required to embed the xkcd font into the PDF file with your tree.

Hopefully, xkcdTree will return a sensible error if you try to run it but have not taken one of the above three steps!

xkcdTree can be called on its own, or through fancyTree(...,type="xkcd").

With respect to the latest CRAN release of phytools, this new package version also has two more new functions: fastAnc and matchNodes (see 1 & 2), as well as updates to a number of other functions that now call fastAnc instead of the much slower (and sometimes quite inaccurate) anc.ML.

Thursday, October 4, 2012

Bug fix in xkcdTree; xkcd Caribbean anole phylogeny

I just realized that there was a bug in xkcdTree, my function for plotting xkcd style phylogenies. Basically, the argument lwd (line width) was not being transferred to the function for drawing xkcd style lines. Oops. Here is a link to a fixed version, which also allows control of the horizontal & vertical dimensions of the plotted tree. Also, for fun, an xkcd tree of 100 Greater Antillean anole species, from Mahler et al. (2010):
> library(phytools)
> source("xkcdTree.R")
> tree<-read.tree("anole.tre")
> xkcdTree(tree,"anoleTree.pdf",fsize=1.0,lwd=3, dim=c(5,13),waver=c(0.05,0.05),jitter=0.0005)

Wednesday, October 3, 2012

xkcd tree plot: part IV

I just posted my code for plotting xkcd style phylogenetic trees. The code uses tips from a stackoverflow.com discussion thread (initiated by my colleague Jarrett Byrnes), as well as a slightly modified version of an entire function posted by user295691.

The function is available on my phytools page (direct link to the code here).

**A few preliminary steps are required before the function will work!**

1. Install the package 'extrafont' and its dependencies.

2. Download and install the xkcd font, xkcd.ttf. The procedure for adding new fonts will depend on your operating system. For Windows 7, go Start → Control Panel → Appearance and Personalization → Fonts and then just drag & drop the .ttf file into this window.

3. At least in Windows, you will have to download and install Ghostscript. This allows you to embed the xkcd font into the PDF in which the tree is outputted.

The input arguments for xkcdTree are as follows: tree, your R "phylo" object; file, your filename for output (xkcdTree saves your plotted tree as a PDF); gsPath, the complete path to the Ghostscript binary (for a gs9.06 in Windows 7, the default should work); fsize, lwd, and color should be relatively self-explanatory; finally jitter is a scalar for the uncertainty to add (more is more wiggly), and waver is a vector with the horizontal and vertical relative 'wavelength' of the uncertainty (smaller numbers result in a shorter wavelength). Note that my previous example (here) has much more jitter than the default.

Let's try it:
> library(phytools)
Loading required package: ape
...
> source("xkcdTree.R")
> tree<-pbtree(n=40)
> xkcdTree(tree,"testTree.pdf",color="black",fsize=1.5)
Loading required package: extrafont
...
That's it!

Tuesday, October 2, 2012

xkcd tree plot: part III

There have been some complaints that my lines are too straight. How about this:

(With the caveat that there is also user control over the degree and frequency of waver if this is too scribbly.)

For some background on this blog post (for those stumbling on it in isolation) see part I and part II.

Now I should probably go do some "real" work. . . .

xkcd tree plot: part II

Ok, now I have added fonts to my xkcd style tree plotter, inspired by a stackoverflow.com question posted by my new UMass-Boston colleague Jarrett Byrnes. Here is what it looks like:


Pretty cool. OK, I will post the function & code I used to do this when I get it straightened up. (It involved a lot of fiddling - so I need to be sure I can retrace my steps!)

New Syst. Biol. paper out 'Advance Access'

My new article "A comment on the use of stochastic character maps to estimate evolutionary rate variation in a continuously valued trait" has just come out Advance Access in the journal Systematic Biology.

Advance Access, in this case, is just the final manuscript version of the article submitted by the authors and is not copyedited or typeset for the journal. (Consequently I noticed that all the figures of the article are repeated twice - I swear I'm not to blame for that!) Nonetheless, please check it out. Direct link to article is here.

Monday, October 1, 2012

xkcd tree plot: part I

Today my colleague Jarrett Byrnes posted a question to stackoverflow.com about creating xkcd style graphs (e.g., here). Basically, the idea is that the graph should look hand drawn (as well as somewhat cartoonish). I thought I'd take a stab at using some of the proffered solutions to try and create xkcd style phylogenetic trees. I'm no R plotting expert; consequently, so far I only have the topology and branches, no labels (and really, that's the most important part). Here is an example:

OK, not so great. This is just using base graphics - maybe I will try using ggplot2, if I can figure it out, as in most of the other stackoverflow.com solutions.

Fix for fastAnc and new function to match nodes

Ok, firstly, the function that I introduced last night for fast estimation the ML ancestral states for internal nodes on the tree (fastAnc) had a bug that would sometimes cause it to return one or multiple NAs if the tree had multifurcations. This was obviously due to some problem in my algorithm for matching nodes between trees. I have resolved this problem by creating a new, standalone utility function to match nodes between trees (called matchNodes. Here a matching pair of nodes is defined by sharing exactly all (i.e., no more & no less) of the same descendant leaves (regardless of intervening topology and branch lengths). The function looks as follows:

matchNodes<-function(tr1,tr2){
  desc.tr1<-lapply(1:tr1$Nnode+length(tr1$tip),
    function(x) extract.clade(tr1,x)$tip.label)
  names(desc.tr1)<-1:tr1$Nnode+length(tr1$tip)
  desc.tr2<-lapply(1:tr2$Nnode+length(tr2$tip),
    function(x) extract.clade(tr2,x)$tip.label)
  names(desc.tr2)<-1:tr2$Nnode+length(tr2$tip)
  Nodes<-matrix(NA,length(desc.tr1),2,dimnames= list(NULL,c("tr1","tr2")))
  for(i in 1:length(desc.tr1)){
    Nodes[i,1]<-as.numeric(names(desc.tr1)[i])
    for(j in 1:length(desc.tr2))
      if(all(desc.tr1[[i]]%in%desc.tr2[[j]]) && all(desc.tr2[[j]]%in%desc.tr1[[i]]))
        Nodes[i,2]<-as.numeric(names(desc.tr2)[j])
  }
  return(Nodes)
}

The top part of this code pulls out two lists of vectors containing the set of leaves descending from each node in each of the two input trees. The second part asks if - for each pair of nodes - all (and exactly all) of the descendant leaves are shared in common. The function returns a matrix containing the node numbers of tr1 in column 1, and their corresponding matches in column 2. Note, then, that the dimensions of the matrix are defined by the number of nodes in the first input tree. Mismatches in the other direction (i.e., nodes found in tree 2, but not in tree 1) won't be identified (but could be by another call to the function in which the argument order was flipped). Code for this function is here.

fastAnc is now highly simplified, as follows:

fastAnc<-function(tree,x){
  if(!is.binary.tree(tree)) btree<-multi2di(tree)
  else btree<-tree
  M<-btree$Nnode
  N<-length(btree$tip)
  anc<-vector()
  for(i in 1:M+N){
    anc[i-N]<-ace(x,multi2di(root(btree,node=i)),
      method="pic")$ace[1]
    names(anc)[i-N]<-i
  }
  if(!is.binary.tree(tree)){
    ancNames<-matchNodes(tree,btree)
    anc<-anc[as.character(ancNames[,2])]
    names(anc)<-ancNames[,1]
  }
  return(anc)
}

I have also made some additional changes to phenogram, phylomorphospace, and phylomorphospace3d, so that they now call fastAnc instead of ace or (worse) anc.ML. I have built this updates into a new non-CRAN version of phytools (version 0.2-01; I know, I just released a new CRAN update of phytools - sorry!) which can be downloaded here and installed from source:

> install.packages("phytools_0.2-01.tar.gz",type="source",
+ repos=NULL)
* installing *source* package 'phytools' ...
** R
...
* DONE (phytools)