Wednesday, November 16, 2011

Including "measurement error" in the estimation of λ

I just posted a new version of the function phylosig() which now allows the user to incorporate measurement error in the estimation of Pagel's λ. Pagel's λ is usually estimated using maximum likelihood, so it was straightforward to condition the estimation of λ & σ2 (the rate of evolution) on an observed set of standard errors. The code for this new version of the function is here. Unfortunately, as I have added capabilities to this function, I have done so in a haphazard fashion, which has resulted in code for this function that has become a little bit more expansive than necessary (compare, for instance, to the code for my new REML version of brownie.lite, here). At some point I will try and clean this up.

To add measurement error to the estimation of λ is not too difficult. Basically, under the λ model we expect that the multivariate distribution of phenotypic values among species will be given by σ2Cλ; where Cλ is an n×n matrix (for n species) containing, in the diagonal, the height of each species above the root; and, in each off-diagonal element Cλ(i,j), the height above the root node of the MRCA of species i and species j multiplied by the coefficient λ. In other words: x ~ MVN(σ2Cλ). Given a set of values for x and a tree, we can estimate λ using likelihood.

To incorporate measurement error (that is, estimation error for the species means for the trait of interest), we just have x ~ MVN(σ2Cλ + E), where E is a matrix containing the error variance and covariances among species. If we assume that estimation error is uncorrelated among species, then E is just a diagonal matrix containing the square of the estimation error for each species. (The assumption that measurement error is independent among species is reasonably safe - but it might be false if, for instance, samples were measured by different investigators or in different years; e.g., see Ives et al. 2007).

The way to run phylosig() with measurement error is simply by supplying the additional vector, se, containing the standard errors for each species. names(se) should contain the species names (corresponding to tree$tip.label, although they need not be in the same order).

To see how the function works, let's try the following example:

> require(phytools); require(geiger)
> set.seed(2)
> gen.lambda<-0.7
> tree<-drop.tip(birthdeath.tree(b=1,d=0,taxa.stop=501),"501")
> se<-sqrt(rchisq(n=500,df=2)); names(se)<-tree$tip.label
> x<-fastBM(lambdaTree(tree,gen.lambda),sig2=1)
> e<-rnorm(n=500,sd=se)
> xe<-x+e
> phylosig(tree,x,method="lambda")
$lambda
[1] 0.7141959
$logL
[1] -1018.6
> phylosig(tree,xe,method="lambda")
$lambda
[1] 0.5431885
$logL
[1] -1153.855
> phylosig(tree,xe,method="lambda",se=se)
$lambda
[1] 0.7274975
$sig2
[1] 1.161162
$logL
[1] -1135.698


I'll note a couple of quick things about this result:

1) The decrease in estimated phylogenetic signal when measurement error is ignored is exactly what we expect for the λ model. This is because measurement error and the λ model have exactly the same effect on the distribution of variation among species (that is, both add extra variance to the tips).

2) The likelihoods for the first result cannot be compared to the likelihoods for the second or third result because it has been obtained for different data (in this case, the hypothetical data for species means without error - the data we never actually have in empirical studies). The likelihoods for results 2 & 3 can be compared because they are obtained for the same data. In this case, the likelihood improved dramatically for the model with measurement error. This suggests that the tree & measurement errors are much better at explaining the observed data than is the tree alone, under the λ model.

No comments:

Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.