## Monday, April 23, 2012

### Error in the estimation of species means and model selection in comparative biology

This will also be part of the focus of my upcoming symposium talk at Evolution this summer, but it is becoming my impression that there is a distinct lack of appreciation for how error in the estimation of species means can affect model selection in comparative biology. Error in the estimation of species means is undoubtedly pervasive. For example, authors commonly measure 1-10 specimens per species, meaning that (if intraspecific variability is modest or high), sampling error can be substantial.

Just to give two quick examples, let's consider the consequences of (ignored) error in the estimation of species means for a single-optimum OU model (Ornstein-Uhlenbeck), and an EB (Early-burst) model.

First, let's simulate tree & data under Brownian evolution:

> require(phytools); require(geiger)
> set.seed(10) # for repeatability, but try multiple seeds
> tree<-pbtree(n=200,scale=1)
> x<-fastBM(tree)

Now, let's add some sampling error using the function sampleFrom in phytools:

> xe<-sampleFrom(x,0.2,rep(1,200))

We can easily see that x and xe are quite closely correlated - about as correlated as we might expect our true & estimated species means to be in any empirical study:

> plot(x,xe)

Next, let's try to fit BM & OU models to our data:

> fitBM<-fitContinuous(tree,xe,model="BM")
Fitting BM model:
> fitBM
\$Trait1
\$Trait1\$lnl
[1] -348.0691
\$Trait1\$beta
[1] 9.366567
\$Trait1\$aic
[1] 700.1382
...
> fitOU<-fitContinuous(tree,xe,model="OU")
Fitting OU model:
> fitOU
\$Trait1
\$Trait1\$lnl
[1] -279.5255
\$Trait1\$beta
[1] 126.3536
\$Trait1\$alpha
[1] 63.67358
...
\$Trait1\$aic
[1] 565.051
...

(It can easily be verified that this is not true of the original data vector x, for which BM is the better model:
> fitContinuous(tree,x,model="BM")\$Trait1\$aic < fitContinuous(tree,x,model="OU")\$Trait1\$aic
Fitting BM model:
Fitting OU model:
[1] TRUE
).

We can see already that for a good amount of sampling error in the estimation of species means, our fitted model flips - from BM (the generating model) to OU. We can also test this for EB. Here, I will adjust the bounds on alpha, to allow either a decrease or an increase in the evolutionary rate through time.

> fitEB<-fitContinuous(tree,xe,model="EB", bounds=list(alpha=c(-100,100)))
Fitting EB model:
> fitEB
\$Trait1
\$Trait1\$lnl
[1] -282.1674
\$Trait1\$beta
[1] 1e-08
\$Trait1\$a
[1] 21.63922
\$Trait1\$aic
[1] 570.3348
...

Here again we see that EB fits much better than the generating model, BM. Of course, it is possible to address this bias by following the method of Ives et al. (2007), and this is even implemented in fitContinuous. Let's try again for the EB example: > fitBMe<-fitContinuous(tree,xe,model="BM", meserr=rep(sqrt(0.2),200))
Fitting BM model:
> fitBMe
\$Trait1
\$Trait1\$lnl
[1] -218.4676
\$Trait1\$beta
[1] 0.908003
\$Trait1\$aic
[1] 440.9353
\$Trait1\$aicc
[1] 440.9965
\$Trait1\$k
[1] 2
> fitEBe<-fitContinuous(tree,xe,model="EB", bounds=list(alpha=c(-100,100)),meserr=rep(0.2,200))
Fitting EB model:
> fitEBe
\$Trait1
\$Trait1\$lnl
[1] -218.4567
\$Trait1\$beta
[1] 0.8149614
\$Trait1\$a
[1] 0.1468192
\$Trait1\$aic
[1] 442.9134
\$Trait1\$aicc
[1] 443.0365
\$Trait1\$k
[1] 3

Cool. This is what we might expect. Finally, it is interesting to consider what might happen if phenotypic data actually arose under EB, but then sampling error in the estimation of species means is (again) ignored.

> x<-fastBM(linearchangeTree(tree,slope=-0.5))
> xe<-sampleFrom(x,0.2,rep(1,200))
> fitContinuous(tree,x,model="EB")
Fitting EB model:
\$Trait1
\$Trait1\$lnl
[1] -69.14438
\$Trait1\$beta
[1] 0.9732424
\$Trait1\$a
[1] -0.6107059
\$Trait1\$aic
[1] 144.2888
\$Trait1\$aicc
[1] 144.4118
\$Trait1\$k
[1] 3
> fitContinuous(tree,xe,model="EB")
Fitting EB model:
\$Trait1
\$Trait1\$lnl
[1] -270.8313
\$Trait1\$beta
[1] 4.326543 \$Trait1\$a
[1] 0
\$Trait1\$aic
[1] 547.6626
\$Trait1\$aicc
[1] 547.7857
\$Trait1\$k
[1] 3

In this example, a very strong EB pattern vanishes when there is sampling error.

Of course, the reader should keep in mind that all of the above results depend critically on the size of sampling error relative to the variance among species. If this is very small, it can be much more safely ignored. For example:

> xe<-sampleFrom(x,0.001,rep(1,200))
> fitContinuous(tree,xe,model="EB")
Fitting EB model:
\$Trait1
\$Trait1\$lnl
[1] -75.14334
\$Trait1\$beta
[1] 0.8179839
\$Trait1\$a
[1] -0.3384102
\$Trait1\$aic
[1] 156.2867
\$Trait1\$aicc
[1] 156.4098
\$Trait1\$k
[1] 3

As I mentioned above, a more detailed treatment of this and other issues is underway for my SSB symposium talk at Evolution 2012 in Ottawa.

1. So, a thought. Could the particular model mis-choice depend on the distribution of branch lengths too? If you have a lot of short terminal branches, mis-estimation of trait values will look like an increase in trait variance on those recent terminal branches. Then the trait error will push you to choose the 'lower signal' model family, like OU and WN. It seems also possible (although I have to think about it a bit more) that if you have a lot of relatively long branches, mis-estimation will favor EB because this will look more like a slow-down.

Also, have you considered other aspects of the mis-estimation problem? Such as that some of the measured specimens may not be mature (or even worse, growth may be indeterminate), thus bringing an uncertain ontogenetic factor in, or they be unnoticed sexual dimorphism...

2. Hi David.

I don't think unaccounted for sampling error in the estimation of species means will ever favor an EB model, as the extra variance (due to sampling error) at the end of all the branches of the tree will tend to look more like an accelerating rather than decelerating rate.

Tree shape, although it will not cause us to pick one model over another, can certainly affect our power to distinguish among models. For instance, if all internal edges of the tree are very short (and the tree is thus star-like), we will have little power to distinguish OU from BM. Consequently, it certainly seems possible that sampling error and tree shape might interact.

- Liam

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.