Tuesday, July 17, 2012

PGLS regression with sampling error

Following some discussion of the topic on the R-sig-phylo special interest group email listserve, I decided to try and program the PGLS regression method for sampling error in X & Y described originally in a paper by Tony Ives, Peter Midford, and Ted Garland in Systematic Biology (Ives et al. 2007). Even though Ives et al. worked out all the hard parts, this was not a completely trivial undertaking as some of the details were a little difficult to work through. In addition, I first tried to program this as custom correlation structure to use with nlme::gls. It's wholly possible that one might be able to do this in theory, but I was not able to (although I did learn a bit about how these correlation structures are written).

In the end, I did this the old fashioned way, and it seems to work (at least, it converges on standard PGLS when sampling error is zero; and it recovers estimates close to the generating parameter values when sampling error is included for in simulation and accounted for during estimation). Unfortunately, I have so far only programmed bivariate regression. It is theoretically straightforward to extend to multivariable regression, with the caveat that optimizing the likelihood function requires inversion of a covariance matrix whose dimensions scale with N × (p+1) for p independent variables. The optimization is also multidimensional because we simultaneously need to maximize the likelihood the for the rates of evolution in x and yx2, σy2), the regression slope (b1) and the states at the root for x and y (a). We don't need to separately estimate b0 - we can just get it from the root values and the slope.

Code for the function is here. Let's try it:

> # require dependencies and load source
> require(phytools)
> source("pgls.Ives.R")
> # simulate a phylogeny
> tree<-pbtree(n=200,scale=1)
> # simulate under some arbitrary regression model
> x<-fastBM(tree)
> y<-5+3.5*x+fastBM(tree,0.2)
> # create vectors of zeroes for the "no error" case
> Vx0<-Vy0<-Cxy0<-rep(0,length(tree$tip))
> names(Vx0)<-names(Vy0)<-names(Cxy0)<-tree$tip.label
> # fit the no error model with simulated data
> res1<-pgls.Ives(tree,x,y,Vx0,Vy0,Cxy0)
> res1
$beta
[1] 4.874868 3.392148
$sig2x
[1] 0.9823115
$sig2y
[1] 1.008072
$a
[1] -0.2179666 4.1354924
$logL
[1] -224.2079
> # for reference, we can also fit this with gls()
> res2<-gls(y~x,data.frame(x,y),correlation=corBrownian(1,tree))
> res2
Generalized least squares fit by REML
  Model: y ~ x
  Data: data.frame(x, y)
  Log-restricted-likelihood: -115.1862
Coefficients:
(Intercept)           x
   4.874825    3.392136
Correlation Structure: corBrownian
Formula: ~1
Parameter estimate(s):
numeric(0)
Degrees of freedom: 200 total; 198 residual
Residual standard error: 1.00909
> # simulate sampling variances and covariances
> Vx<-rexp(length(tree$tip))/2
> Vy<-rexp(length(tree$tip))/2
> Cxy=sqrt(Vx*Vy)*runif(length(tree$tip),min=-1,max=1)
> names(Vx)<-names(Vy)<-names(Cxy)<-names(x)
> # simulate data with sampling error
> xe<-ye<-vector()
> for(i in 1:length(tree$tip)){
  temp<-mvrnorm(mu=c(x[i],y[i]),
Sigma=matrix(c(Vx[i],Cxy[i],Cxy[i],Vy[i]),2,2))
  xe[i]<-temp[1]; ye[i]<-temp[2]
  }
> names(xe)<-names(x)
> names(ye)<-names(y)
> # fit no error model to data with error
> res3<-pgls.Ives(tree,xe,ye,Vx0,Vy0,Cxy0)
> res3
$beta
[1] 4.1621231 0.2000734
$sig2x
[1] 27.1866
$sig2y
[1] 24.35597
$a
[1] -0.2924528  4.1036111
$logL
[1] -874.7336
> # fit error model
> res4<-pgls.Ives(tree,xe,ye,Vx,Vy,Cxy)
> res4
$beta
[1] 5.133185 3.485432
$sig2x
[1] 0.9533415
$sig2y
[1] 0.830365
$a
[1] -0.2760699  4.1709624
$logL
[1] -568.1378


Obviously, in this case our estimate - particularly of the regression slope - is quite bad when sampling error is ignored; but we do much better with the method of Ives et al. Cool!

1 comment:

  1. Hi Liam,

    Any chance you've programmed multivariate regression for pgls.Ives yet? I'm not aware of another package that has a similar function to include sampling error.

    Thanks!

    ReplyDelete