Wednesday, September 14, 2022

On the expected distribution of the residual error from phylogenetic regression & ANOVA

A phytools user recently asked the following interesting question:

“I contact you for a statistical question regarding GLS and PGLS, which I am not super familiar with. I have read both that GLS do and do not require the distribution of residuals to be normal, and I cannot make a decision on this. A reviewer even ask us to test for normality of the variables, not even the residuals. What is your opinion on this?”

As I have both posted on this blog and published about in the past:

“…we do not expect normality of the dependent (or independent, for continuous x) variables in phylogenetic data. Instead, what we do expect is multivariate normality of the residual error in y given X (or, equivalently, normality of y given X, controlling for the tree).”

If we want to test for normality, we should test for the multivariate normality of the residual errors of our fitted model. One way to do that is to back-transform our data using a decomposition (like a Cholesky factorization) of the phylogenetic covariance matrix (e.g., following Butler et al 2000), and then test for normality of these transformed residuals. This is functionality (& philosophically) equivalent to a simpler normalization of uncorrelated random data.

Here's a quick example, using simulation. In this case, I'll use an independent variable, X, that is a factor and then do a phylogenetic generalized ANOVA, but the same analysis would apply with a GLS regression. For my normality test, I will use the function lillie.test of the nortest R package on CRAN.

## load packages
library(phytools)
library(nlme)
library(nortest)
## simulate tree
tree<-NULL
while(is.null(tree))
    tree<-pbtree(b=1,d=0.8,t=20,
        extant.only=TRUE)
## Warning:
##   no extant tips, tree returned as NULL
## simulate phylogenetic residual error
e<-fastBM(tree,sig2=2)
## simulate factor
X<-setNames(as.factor(sample(1:3,Ntip(tree),
    replace=TRUE)),tree$tip.label)
## arbitrary beta
beta<-c(5,-10,30)
## simulate y
y<-beta[X]+e
## put data in data frame
Data<-data.frame(x=X,y=y)
## create correlation structure
spp<-rownames(Data)
bm<-corBrownian(1,tree,form=~spp)
## fit generalized anova using nlme::gls
fit.anova<-gls(y~x,data=Data,correlation=bm)
fit.anova
## Generalized least squares fit by REML
##   Model: y ~ x 
##   Data: Data 
##   Log-restricted-likelihood: -661.9959
## 
## Coefficients:
## (Intercept)          x2          x3 
##    7.775999  -14.964163   24.984617 
## 
## Correlation Structure: corBrownian
##  Formula: ~spp 
##  Parameter estimate(s):
## numeric(0)
## Degrees of freedom: 355 total; 352 residual
## Residual standard error: 5.418697

Now that I've fit my model, I'm going to visualize and then test for normality of three different vectors. First, the original data for my dependent variable, y; second, my untransformed residuals from the fitted model; last, my transformed residuals in which I use the Cholesky factorization of the inverse of my phylogenetic covariance matrix, vcv(tree).

## original response variable, y
d1<-density(y,bw=1)
plot(d1,bty="n",type="n",main="density of y",
    font.main=3)
polygon(d1,col=make.transparent("blue",0.25),
    border="blue")

plot of chunk unnamed-chunk-2

## normality test (always fails)
lillie.test(y)
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  y
## D = 0.099288, p-value = 5.66e-09
## untransformed residuals
d2<-density(residuals(fit.anova),bw=1)
plot(d2,bty="n",type="n",main="density of untransformed residuals",
    font.main=3)
polygon(d2,col=make.transparent("blue",0.25),
    border="blue")

plot of chunk unnamed-chunk-2

## normality test (often fails)
lillie.test(residuals(fit.anova))
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  residuals(fit.anova)
## D = 0.040746, p-value = 0.162
## transformed residuals
transformed<-chol(solve(vcv(tree)))%*%residuals(fit.anova)
d3<-density(transformed,bw=1)
plot(d3,bty="n",type="n",main="density of transformed residuals",
    font.main=3)
polygon(d3,col=make.transparent("blue",0.25),
    border="blue")

plot of chunk unnamed-chunk-2

## normality test (usually passes)
lillie.test(transformed)
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  transformed
## D = 0.041221, p-value = 0.1506

OK, now let's try a “real” example, in this case from my book with Luke Harmon using data from Kirk & Kay (2004).

## load data
primate.data<-read.csv("http://www.phytools.org/Rbook/3/primateEyes.csv",
    row.names=1,stringsAsFactors=TRUE)
primate.tree<-read.tree("http://www.phytools.org/Rbook/3/primateEyes.phy")
## create correlation structure
spp<-rownames(primate.data)
corBM<-corBrownian(phy=primate.tree,form=~spp)
## fit model
primate.ancova<-gls(log(Orbit_area)~log(Skull_length)+
    Activity_pattern,data=primate.data,
    correlation=corBM)
summary(primate.ancova)
## Generalized least squares fit by REML
##   Model: log(Orbit_area) ~ log(Skull_length) + Activity_pattern 
##   Data: primate.data 
##         AIC       BIC   logLik
##   -86.45385 -74.18212 48.22693
## 
## Correlation Structure: corBrownian
##  Formula: ~spp 
##  Parameter estimate(s):
## numeric(0)
## 
## Coefficients:
##                                Value Std.Error   t-value p-value
## (Intercept)               -0.7310348 0.3761724 -1.943350  0.0552
## log(Skull_length)          1.4717247 0.0758534 19.402228  0.0000
## Activity_patternDiurnal   -0.1060187 0.0777268 -1.363991  0.1761
## Activity_patternNocturnal  0.2935941 0.1213931  2.418540  0.0177
## 
##  Correlation: 
##                           (Intr) lg(S_) Actv_D
## log(Skull_length)         -0.914              
## Activity_patternDiurnal   -0.277  0.100       
## Activity_patternNocturnal -0.518  0.329  0.602
## 
## Standardized residuals:
##         Min          Q1         Med          Q3         Max 
## -1.89188666 -0.77574126 -0.41922692  0.07417665  2.26411750 
## 
## Residual standard error: 0.2933082 
## Degrees of freedom: 90 total; 86 residual
## test for normality of transformed residuals
transformed<-chol(solve(vcv(primate.tree)))%*%residuals(primate.ancova)
d3<-density(transformed,bw=0.05)
plot(d3,bty="n",type="n",main="density of transformed residuals (primate ANCOVA)",
    font.main=3)
polygon(d3,col=make.transparent("blue",0.25),
    border="blue")

plot of chunk unnamed-chunk-3

lillie.test(transformed)
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  transformed
## D = 0.094818, p-value = 0.04442

So, in this case we see that our transformed residuals fail normality, although the failure is not too severe so (just as with OLS) we should not be overly concerned, IMO.

That's it!

No comments:

Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.