Following some discussion of the topic on the R-sig-phylo special interest group email listserve, I decided to try and program the PGLS regression method for sampling error in X & Y described originally in a paper by Tony Ives, Peter Midford, and Ted Garland in Systematic Biology (Ives et al. 2007). Even though Ives et al. worked out all the hard parts, this was not a completely trivial undertaking as some of the details were a little difficult to work through. In addition, I first tried to program this as custom correlation structure to use with nlme::gls. It's wholly possible that one might be able to do this in theory, but I was not able to (although I did learn a bit about how these correlation structures are written).
In the end, I did this the old fashioned way, and it seems to work (at least, it converges on standard PGLS when sampling error is zero; and it recovers estimates close to the generating parameter values when sampling error is included for in simulation and accounted for during estimation). Unfortunately, I have so far only programmed bivariate regression. It is theoretically straightforward to extend to multivariable regression, with the caveat that optimizing the likelihood function requires inversion of a covariance matrix whose dimensions scale with N × (p+1) for p independent variables. The optimization is also multidimensional because we simultaneously need to maximize the likelihood the for the rates of evolution in x and y (σx2, σy2), the regression slope (b1) and the states at the root for x and y (a). We don't need to separately estimate b0 - we can just get it from the root values and the slope.
Code for the function is here. Let's try it:
> # require dependencies and load source
> require(phytools)
> source("pgls.Ives.R")
> # simulate a phylogeny
> tree<-pbtree(n=200,scale=1)
> # simulate under some arbitrary regression model
> x<-fastBM(tree)
> y<-5+3.5*x+fastBM(tree,0.2)
> # create vectors of zeroes for the "no error" case
> Vx0<-Vy0<-Cxy0<-rep(0,length(tree$tip))
> names(Vx0)<-names(Vy0)<-names(Cxy0)<-tree$tip.label
> # fit the no error model with simulated data
> res1<-pgls.Ives(tree,x,y,Vx0,Vy0,Cxy0)
> res1
$beta
[1] 4.874868 3.392148
$sig2x
[1] 0.9823115
$sig2y
[1] 1.008072
$a
[1] -0.2179666 4.1354924
$logL
[1] -224.2079
> # for reference, we can also fit this with gls()
> res2<-gls(y~x,data.frame(x,y),correlation=corBrownian(1,tree))
> res2
Generalized least squares fit by REML
Model: y ~ x
Data: data.frame(x, y)
Log-restricted-likelihood: -115.1862
Coefficients:
(Intercept) x
4.874825 3.392136
Correlation Structure: corBrownian
Formula: ~1
Parameter estimate(s):
numeric(0)
Degrees of freedom: 200 total; 198 residual
Residual standard error: 1.00909
> # simulate sampling variances and covariances
> Vx<-rexp(length(tree$tip))/2
> Vy<-rexp(length(tree$tip))/2
> Cxy=sqrt(Vx*Vy)*runif(length(tree$tip),min=-1,max=1)
> names(Vx)<-names(Vy)<-names(Cxy)<-names(x)
> # simulate data with sampling error
> xe<-ye<-vector()
> for(i in 1:length(tree$tip)){
temp<-mvrnorm(mu=c(x[i],y[i]),
Sigma=matrix(c(Vx[i],Cxy[i],Cxy[i],Vy[i]),2,2))
xe[i]<-temp[1]; ye[i]<-temp[2]
}
> names(xe)<-names(x)
> names(ye)<-names(y)
> # fit no error model to data with error
> res3<-pgls.Ives(tree,xe,ye,Vx0,Vy0,Cxy0)
> res3
$beta
[1] 4.1621231 0.2000734
$sig2x
[1] 27.1866
$sig2y
[1] 24.35597
$a
[1] -0.2924528 4.1036111
$logL
[1] -874.7336
> # fit error model
> res4<-pgls.Ives(tree,xe,ye,Vx,Vy,Cxy)
> res4
$beta
[1] 5.133185 3.485432
$sig2x
[1] 0.9533415
$sig2y
[1] 0.830365
$a
[1] -0.2760699 4.1709624
$logL
[1] -568.1378
Obviously, in this case our estimate - particularly of the regression slope - is quite bad when sampling error is ignored; but we do much better with the method of Ives et al. Cool!
Hi Liam,
ReplyDeleteAny chance you've programmed multivariate regression for pgls.Ives yet? I'm not aware of another package that has a similar function to include sampling error.
Thanks!