## Saturday, June 7, 2014

### Performing stepwise phylogenetic regression in R

"Is there a way to do a multiple stepwise regression on continuous data that incorporates phylogenetic information and selects the best model based on an AICc score?"

Although I myself had not done this before, I figured that stepwise regression with objects of class "gls" would likely be something that someone, somewhere, in the wide world of R would have thought of before - and this is indeed correct.

Here's a complete demo with simulation code.

First, simulate data. In this case, x1 and x3 have an effect on y, so they ought to be found in the final model - but x2 and x4 intentionally do not.

require(phytools)
require(nlme)
require(MASS)

## simulate data (using phytools)
tree<-pbtree(n=26,tip.label=LETTERS)
x1<-fastBM(tree)
x2<-fastBM(tree)
x3<-fastBM(tree)
x4<-fastBM(tree)
y<-0.75*x1-0.5*x3+fastBM(tree,sig2=0.2)
X<-data.frame(y,x1,x2,x3,x4)

Now let's fit our model & run the stepwise regression method using AIC:

## build model (using nlme)
obj<-gls(y~x1+x2+x3+x4,data=X,correlation=
corBrownian(1,tree),method="ML")

## run stepwise AIC (using MASS & nlme)
fit<-stepAIC(obj,direction="both",trace=0)

trace just controls the amount of information about the backward & forward stepwise procedure that is being run. If we want more, we should increase it.

Finally, here's a worked simulation with results:

> ## simulate data (using phytools)
> tree<-pbtree(n=26,tip.label=LETTERS)
> x1<-fastBM(tree)
> x2<-fastBM(tree)
> x3<-fastBM(tree)
> x4<-fastBM(tree)
> y<-0.75*x1-0.5*x3+fastBM(tree,sig2=0.2)
> X<-data.frame(y,x1,x2,x3,x4)
> ## build model (using nlme)
> obj<-gls(y~x1+x2+x3+x4,data=X,correlation=
corBrownian(1,tree),method="ML")
> ## run stepwise AIC (using MASS & nlme)
> fit<-stepAIC(obj,direction="both",trace=0)
> fit
Generalized least squares fit by maximum likelihood
Model: y ~ x1 + x3
Data: X
Log-likelihood: -10.43514

Coefficients:
(Intercept)          x1          x3
-0.4952197   0.6882657  -0.5108970

Correlation Structure: corBrownian
Formula: ~1
Parameter estimate(s):
numeric(0)
Degrees of freedom: 26 total; 23 residual
Residual standard error: 0.594681

This is just about what we'd expect. Cool.