Friday, June 29, 2012

R trick 1: get the frequencies of factors in a vector

Here's a quick R hint. (I had briefly forgotten how to do this, and the solution wasn't totally obvious online.) Say I have a vector of factors in memory in R and I want to get the frequency or relative frequency of the different levels of the factor, I can do this using the base generic function summary. Just to see how this works, consider a vector containing the best-fitting quantitative trait evolution model for a set of 100 trees:

> best.fit
 [1] BM     BM     OU     lambda BM     BM     OU     BM
 [9] BM     BM     BM     lambda lambda BM     OU     BM
[17] BM     BM     OU     BM     BM     BM     lambda BM
[25] BM     BM     BM     OU     lambda BM     BM     BM
[33] BM     OU     BM     BM     lambda lambda lambda BM
[41] BM     BM     BM     OU     BM     BM     BM     BM
[49] BM     OU     BM     BM     BM     BM     BM     BM
[57] lambda lambda OU     OU     BM     BM     lambda BM
[65] BM     BM     BM     BM     lambda BM     BM     BM
[73] BM     OU     OU     BM     lambda BM     lambda BM
[81] BM     lambda BM     BM     BM     OU     BM     BM
[89] BM     BM     OU     OU     lambda BM     BM     BM
[97] BM     BM     OU     BM
Levels: BM lambda OU


We can count up the number or relative frequency of trees with each best fit model as follows:

> summary(best.fit)
   BM lambda     OU
   68     16     16
> summary(best.fit)/sum(summary(best.fit))
   BM lambda     OU
 0.68   0.16   0.16


That's it.

Monday, June 18, 2012

In the field

I'm presently in the field in Puerto Rico through June 26th, so I may not be blogging much on phytools; however I did post a short field story to Anole Annals (see link here) and I may do that a few more times before I'm done. BTW, the picture below is of Anolis krugi (the "mountain" grass-bush anole in Puerto Rico, although we are only at about 200 feet above sea level), one of the species we will be studying on the island.

Sunday, June 10, 2012

New function to export data & trees in XML format for SIMMAP

Phytools has a function, make.simmap, that can be used to generate stochastic character mapped trees for discretely valued character data and phylogenies. This function, though, is really a much simplified version of Jonathan Bollback's stand alone program SIMMAP. The biggest difference between stochastic mapping in his program and with phytools::make.simmap is that my function first fits a model of character evolution using ML, and then samples from the conditional posterior distribution of character histories (conditioning on the fitted model), rather than the joint posterior distribution of substitution rates and character histories.

Some phytools users might therefore be interested in exporting their character data & trees for analysis in SIMMAP. In fact, I can attest to this because I want to do it - mainly to cross-validate a result that I obtained using make.simmap in phytools. Phytools can already be used to read in stochastic histories generated using SIMMAP (now both v1.0 and v1.5).

SIMMAP (v1.5, I'll get to v1.0 later) reads trees and data in XML format. XML is a markup language, like HTML, using tags for different components of the data structure. SIMMAP can read both DNA and discrete, numerically coded morphological traits. The former is called datatype="nucleotide" by SIMMAP and the latter datatype="standard" by SIMMAP.

I have now posted a function online, export.as.xml, that can create an XML datafile using data stored in various formats and a tree or multiple trees stored as a "phylo" or "multiPhylo" object (respectively).

A direct link to the code is here. I will also most likely include it in future releases of phytools. (I may also include the capacity to export for SIMMAP v1.0. SIMMAP v1.0 does not use XML, but a modified NEXUS format, and is still in fairly wide use.)

Friday, June 8, 2012

Blomberg's K (phylogenetic signal) with intraspecific variability, more than one sample per species, error in the estimation of species means, etc.

I feel certain that I have addressed this in several past posts, but due to some recent R-sig-phylo and phytools user inquiries, I thought I would just re-post a general procedure for computing phylogenetic signal with intraspecific variability. This includes estimation of signal with multiple observations per species, and computing phylogenetic signal when species means are uncertain. I start by assuming that the data are a vector x in which species names are in names(x) (and thus observations from the same species have the same names). I assume also that the tree (a "phylo" object) is contained in tree.

A little aside might be worthwhile to mention here. When we read in our data from file, we often have our species names in the first column of the input file. We would then do (say) X<-read.table(...,row.names=1) so that R will use the first column of our input data file as the row names of the data table, X. The problem with this in the case of multiple observations per species is that the data type "data.frame" does not allow row names to repeat. In this case, we should just read the row names in in the first column of X, and then do: x<-X[,2]; names(x)<-X[,1], which will give us what we want.

To incorporate estimation error (i.e., multiple observations per species) we should use the method of Ives et al. (2007). To do this, we need sampling errors for each species. This presents a problem when we have a data set that consists of multiple observations for some species - but only one per species for others. In this situation, I recommend assuming that the variances we cannot calculate (i.e., the ones for species with only one observation) are equal to the mean within-species variance. We could also use a pooled within species variance, putting more weight on the variances calculated for species with large sample sizes. Either assumption will be safest if the intraspecific variance is not too heterogeneous among species, and if we have log-transformed our data before analysis. Neither procedure is all that difficult (the pooled variance calculation is just a touch trickier), so I will demonstrate both.

> require(phytools)
> # first let's simulate a tree & data for the purposes
> # of demonstration (normally read from file)
> N<-300 # number of species for simulation
> tree<-pbtree(n=N)
> # simulate the true species means
> xhat<-fastBM(tree)
> phylosig(tree,xhat)
[1] 0.9623659
> # simulate 1-20 observations per species
> # with the same intraspecific variance
> x<-sampleFrom(xhat,rep(1,N),randn=c(1,20))
> # get the means by species
> temp<-aggregate(x,by=list(names(x)),mean)
> xbar<-temp[,2]; names(xbar)<-temp[,1]
> # get the variance by species
> temp<-aggregate(x,by=list(names(x)),var)
> xvar<-temp[,2]; names(xvar)<-temp[,1]
> # get the sample size per species
> n<-as.vector(table(names(x)))
> # replace NA with mean (m) or pooled (p) variance
> xvarm<-xvarp<-xvar
> xvarm[is.na(xvar)]<-mean(xvar,na.rm=TRUE)
> xvarp[is.na(xvar)]<-0
> xvarp[is.na(xvar)]<-
   sum((n-1)*xvarp/(sum(n[n>1])-length(n[n>1])))
> # compute K ignoring sampling error
> phylosig(tree,xbar)
[1] 0.6166613
> # compute K with sampling error, mean variance for n=1
> phylosig(tree,xbar,se=sqrt(xvarm/n))
$K
[1] 0.9476694
$sig2
[1] 1.039297
$logL
[1] -444.4693
> # compute K with sampling error, pooled variance for n=1
> phylosig(tree,xbar,se=sqrt(xvarp/n))
$K
[1] 0.947966
$sig2
[1] 1.039077
$logL
[1] -444.4875

One characteristic feature that emerges from this is that phylogenetic signal is underestimated (here fairly dramatically, although this will depend on the amount of sampling error we've ignored) when uncertainty in the estimation of species means is ignored. This is a general phenomenon. We do much better when we include it in our estimating procedure.