Friday, June 8, 2012

Blomberg's K (phylogenetic signal) with intraspecific variability, more than one sample per species, error in the estimation of species means, etc.

I feel certain that I have addressed this in several past posts, but due to some recent R-sig-phylo and phytools user inquiries, I thought I would just re-post a general procedure for computing phylogenetic signal with intraspecific variability. This includes estimation of signal with multiple observations per species, and computing phylogenetic signal when species means are uncertain. I start by assuming that the data are a vector x in which species names are in names(x) (and thus observations from the same species have the same names). I assume also that the tree (a "phylo" object) is contained in tree.

A little aside might be worthwhile to mention here. When we read in our data from file, we often have our species names in the first column of the input file. We would then do (say) X<-read.table(...,row.names=1) so that R will use the first column of our input data file as the row names of the data table, X. The problem with this in the case of multiple observations per species is that the data type "data.frame" does not allow row names to repeat. In this case, we should just read the row names in in the first column of X, and then do: x<-X[,2]; names(x)<-X[,1], which will give us what we want.

To incorporate estimation error (i.e., multiple observations per species) we should use the method of Ives et al. (2007). To do this, we need sampling errors for each species. This presents a problem when we have a data set that consists of multiple observations for some species - but only one per species for others. In this situation, I recommend assuming that the variances we cannot calculate (i.e., the ones for species with only one observation) are equal to the mean within-species variance. We could also use a pooled within species variance, putting more weight on the variances calculated for species with large sample sizes. Either assumption will be safest if the intraspecific variance is not too heterogeneous among species, and if we have log-transformed our data before analysis. Neither procedure is all that difficult (the pooled variance calculation is just a touch trickier), so I will demonstrate both.

> require(phytools)
> # first let's simulate a tree & data for the purposes
> # of demonstration (normally read from file)
> N<-300 # number of species for simulation
> tree<-pbtree(n=N)
> # simulate the true species means
> xhat<-fastBM(tree)
> phylosig(tree,xhat)
[1] 0.9623659
> # simulate 1-20 observations per species
> # with the same intraspecific variance
> x<-sampleFrom(xhat,rep(1,N),randn=c(1,20))
> # get the means by species
> temp<-aggregate(x,by=list(names(x)),mean)
> xbar<-temp[,2]; names(xbar)<-temp[,1]
> # get the variance by species
> temp<-aggregate(x,by=list(names(x)),var)
> xvar<-temp[,2]; names(xvar)<-temp[,1]
> # get the sample size per species
> n<-as.vector(table(names(x)))
> # replace NA with mean (m) or pooled (p) variance
> xvarm<-xvarp<-xvar
> xvarm[is.na(xvar)]<-mean(xvar,na.rm=TRUE)
> xvarp[is.na(xvar)]<-0
> xvarp[is.na(xvar)]<-
   sum((n-1)*xvarp/(sum(n[n>1])-length(n[n>1])))
> # compute K ignoring sampling error
> phylosig(tree,xbar)
[1] 0.6166613
> # compute K with sampling error, mean variance for n=1
> phylosig(tree,xbar,se=sqrt(xvarm/n))
$K
[1] 0.9476694
$sig2
[1] 1.039297
$logL
[1] -444.4693
> # compute K with sampling error, pooled variance for n=1
> phylosig(tree,xbar,se=sqrt(xvarp/n))
$K
[1] 0.947966
$sig2
[1] 1.039077
$logL
[1] -444.4875

One characteristic feature that emerges from this is that phylogenetic signal is underestimated (here fairly dramatically, although this will depend on the amount of sampling error we've ignored) when uncertainty in the estimation of species means is ignored. This is a general phenomenon. We do much better when we include it in our estimating procedure.

No comments:

Post a Comment