Friday, May 8, 2015

PGLS with measurement or sampling error in the dependent variable, y

In the following demo I show one way to do phylogenetic generalized least squares (PGLS) assuming a Brownian motion model for the variance-covariance structure of the residual error, for conditions in which there is measurement error in the estimation of y, the dependent variable in our model, but not our xs. This was motivated by a question from a UMass-Boston graduate student based on the phylogenetic (generalized) ANOVA - so this is a case in which there might be easily quantifiable error in y but not in our factor or factors.

The way I have done this is not by writing a new corStruct - although one would perhaps ideally like to this.

First, this is the likelihood functin that we are going to optimize. It also returns the fitted model & likelihood when opt=FALSE.

lk<-function(sig2,y,X,C,v=NULL,opt=TRUE){
    n<-nrow(C)
    if(is.null(v)) v<-rep(0,n)
    V<-sig2*C+diag(v)
    beta<-solve(t(X)%*%solve(V)%*%X)%*%(t(X)%*%solve(V)%*%y)
    logL<--(1/2)*t(y-X%*%beta)%*%solve(V)%*%(y-X%*%beta)-
        (1/2)*determinant(V,logarithm=TRUE)$modulus[1]-
        (n/2)*log(2*pi)
    if(opt) -logL[1,1] else list(beta=beta[,1],sig2e=sig2,logL=logL[1,1])
}

Now, next, I start with an illustrative demo using multivariable regressin to show that this function will return the same fitted model (when no sampling error within species is assumed) as gls(...,correlation=corBrownian(...)). Here I go:

## load libraries
library(phytools)
library(nlme)
## simulate tree & data
tree<-pbtree(n=100,scale=1)
X<-fastBM(tree,nsim=2)
colnames(X)<-c("x1","x2")
y<-cbind(rep(1,Ntip(tree)),X)%*%c(1,2,3)+fastBM(tree)
## first, fit the model using gls
fit.gls<-gls(y~x1+x2,data=data.frame(y,X),correlation=corBrownian(1,tree),
    method="ML")
fit.gls
## Generalized least squares fit by maximum likelihood
##   Model: y ~ x1 + x2 
##   Data: data.frame(y, X) 
##   Log-likelihood: -47.8675
## 
## Coefficients:
## (Intercept)          x1          x2 
##   0.9532797   1.9886343   3.0334548 
## 
## Correlation Structure: corBrownian
##  Formula: ~1 
##  Parameter estimate(s):
## numeric(0)
## Degrees of freedom: 100 total; 97 residual
## Residual standard error: 0.9198236
## now fit it using our custom function
## the interval for optimize is specified arbitrarily
fit.lk<-optimize(lk,c(0,1000),y=y,X=cbind(rep(1,Ntip(tree)),X),C=vcv(tree))
fitted<-lk(fit.lk$minimum,y=y,X=cbind(rep(1,Ntip(tree)),X),C=vcv(tree),
    opt=FALSE)
fitted
## $beta
##                  x1        x2 
## 0.9532797 1.9886343 3.0334548 
## 
## $sig2e
## [1] 0.8460792
## 
## $logL
## [1] -47.8675

So, here we can see that it works when we do not have sampling error in y. Next, let's simulate known sampling error and try again:

## these will be our within-species sampling variances
v<-setNames(rexp(n=Ntip(tree)),tree$tip.label)
ye<-setNames(sampleFrom(xbar=y,xvar=v,n=rep(1,length(y))),rownames(y))
fit.lk<-optimize(lk,c(0,1000),y=ye,X=cbind(rep(1,Ntip(tree)),X),
    C=vcv(tree),v=v)
fitted<-lk(fit.lk$minimum,y=ye,X=cbind(rep(1,Ntip(tree)),X),C=vcv(tree),
    v=v,opt=FALSE)
fitted
## $beta
##                x1       x2 
## 0.863873 2.039439 2.969356 
## 
## $sig2e
## [1] 0.6355355
## 
## $logL
## [1] -138.4668
## compare to:
gls(ye~x1+x2,data=data.frame(ye,X),correlation=corBrownian(1,tree),method="ML")
## Generalized least squares fit by maximum likelihood
##   Model: ye ~ x1 + x2 
##   Data: data.frame(ye, X) 
##   Log-likelihood: -231.8525
## 
## Coefficients:
## (Intercept)          x1          x2 
##    1.508517    1.947806    4.554398 
## 
## Correlation Structure: corBrownian
##  Formula: ~1 
##  Parameter estimate(s):
## numeric(0)
## Degrees of freedom: 100 total; 97 residual
## Residual standard error: 5.790837

Now, it also turns out that after we have optimized sig2e we can actually coerce gls into giving us the correct fitted model & likelihood. Here, I do this by distorting the edge lengths of our tree to take into account the fitted sig2e and within-species errors in y.

tt<-tree
tt$edge.length<-tt$edge.length*fitted$sig2e
for(i in 1:length(v)){
    tip<-which(tt$tip.label==names(v)[i])
    ii<-which(tt$edge[,2]==tip)
    tt$edge.length[ii]<-tt$edge.length[ii]+v[i]
}
vv<-diag(vcv(tt))
w<-varFixed(~vv)
fit.gls<-gls(ye~x1+x2,data=data.frame(ye,X),correlation=corBrownian(1,tt),method="ML",weights=w)
fit.gls
## Generalized least squares fit by maximum likelihood
##   Model: ye ~ x1 + x2 
##   Data: data.frame(ye, X) 
##   Log-likelihood: -138.4384
## 
## Coefficients:
## (Intercept)          x1          x2 
##    0.863873    2.039439    2.969356 
## 
## Correlation Structure: corBrownian
##  Formula: ~1 
##  Parameter estimate(s):
## numeric(0)
## Variance function:
##  Structure: fixed weights
##  Formula: ~vv 
## Degrees of freedom: 100 total; 97 residual
## Residual standard error: 1.0169
## compare to:
fitted
## $beta
##                x1       x2 
## 0.863873 2.039439 2.969356 
## 
## $sig2e
## [1] 0.6355355
## 
## $logL
## [1] -138.4668

Now, here's a similar demo using the phylogenetic (generalized) ANOVA:

## evolve a factor on the tree
Q<-matrix(c(-2,1,1,1,-2,1,1,1,-2),3,3)
colnames(Q)<-rownames(Q)<-letters[1:3]
x<-as.factor(sim.history(tree,Q)$states)
## Done simulation(s).
y<-fastBM(tree)
y[x=="a"]<-y[x=="a"]+1
y[x=="b"]<-y[x=="b"]+2
y[x=="c"]<-y[x=="c"]+3
fit.gls<-gls(y~x,data=data.frame(y,x),correlation=corBrownian(1,tree),method="ML")
fit.gls
## Generalized least squares fit by maximum likelihood
##   Model: y ~ x 
##   Data: data.frame(y, x) 
##   Log-likelihood: -58.60169
## 
## Coefficients:
## (Intercept)          xb          xc 
##   0.8929341   0.8188519   1.8667981 
## 
## Correlation Structure: corBrownian
##  Formula: ~1 
##  Parameter estimate(s):
## numeric(0)
## Degrees of freedom: 100 total; 97 residual
## Residual standard error: 1.024053
summary(fit.gls)
## Generalized least squares fit by maximum likelihood
##   Model: y ~ x 
##   Data: data.frame(y, x) 
##        AIC      BIC    logLik
##   125.2034 135.6241 -58.60169
## 
## Correlation Structure: corBrownian
##  Formula: ~1 
##  Parameter estimate(s):
## numeric(0)
## 
## Coefficients:
##                 Value Std.Error   t-value p-value
## (Intercept) 0.8929341 0.3801839  2.348690  0.0209
## xb          0.8188519 0.1411306  5.802084  0.0000
## xc          1.8667981 0.1658942 11.252944  0.0000
## 
##  Correlation: 
##    (Intr) xb    
## xb -0.181       
## xc -0.169  0.431
## 
## Standardized residuals:
##         Min          Q1         Med          Q3         Max 
## -3.51162932 -1.05473759 -0.05870776  0.70967087  2.99993479 
## 
## Residual standard error: 1.024053 
## Degrees of freedom: 100 total; 97 residual
X<-model.matrix(~x)
fit.lk<-optimize(lk,c(0,1000),y=y,X=X,C=vcv(tree))
fitted<-lk(fit.lk$minimum,y=y,X=X,C=vcv(tree),opt=FALSE)
fitted
## $beta
## (Intercept)          xb          xc 
##   0.8929341   0.8188519   1.8667981 
## 
## $sig2e
## [1] 1.048674
## 
## $logL
## [1] -58.60169
## now with sampling error in y
ye<-setNames(sampleFrom(xbar=y,xvar=v,n=rep(1,length(y))),rownames(y))
## fit model
fit.lk<-optimize(lk,c(0,1000),y=ye,X=X,C=vcv(tree),v=v)
fitted<-lk(fit.lk$minimum,y=ye,X=X,C=vcv(tree),v=v,opt=FALSE)
fitted
## $beta
## (Intercept)          xb          xc 
##   0.7462831   1.2109657   1.9742749 
## 
## $sig2e
## [1] 1.338498
## 
## $logL
## [1] -136.757
## coerce tree again:
tt<-tree
tt$edge.length<-tt$edge.length*fitted$sig2e
for(i in 1:length(v)){
    tip<-which(tt$tip.label==names(v)[i])
    ii<-which(tt$edge[,2]==tip)
    tt$edge.length[ii]<-tt$edge.length[ii]+v[i]
}
vv<-diag(vcv(tt))
w<-varFixed(~vv)
fit.gls<-gls(ye~x,data=data.frame(ye,x),correlation=corBrownian(1,tt),method="ML",weights=w)
fit.gls
## Generalized least squares fit by maximum likelihood
##   Model: ye ~ x 
##   Data: data.frame(ye, x) 
##   Log-likelihood: -134.6691
## 
## Coefficients:
## (Intercept)          xb          xc 
##   0.7462831   1.2109657   1.9742749 
## 
## Correlation Structure: corBrownian
##  Formula: ~1 
##  Parameter estimate(s):
## numeric(0)
## Variance function:
##  Structure: fixed weights
##  Formula: ~vv 
## Degrees of freedom: 100 total; 97 residual
## Residual standard error: 0.8591587
summary(fit.gls)
## Generalized least squares fit by maximum likelihood
##   Model: ye ~ x 
##   Data: data.frame(ye, x) 
##        AIC     BIC    logLik
##   277.3383 287.759 -134.6691
## 
## Correlation Structure: corBrownian
##  Formula: ~1 
##  Parameter estimate(s):
## numeric(0)
## Variance function:
##  Structure: fixed weights
##  Formula: ~vv 
## 
## Coefficients:
##                 Value Std.Error  t-value p-value
## (Intercept) 0.7462831 0.3900422 1.913339  0.0587
## xb          1.2109657 0.2408905 5.027037  0.0000
## xc          1.9742749 0.3019824 6.537715  0.0000
## 
##  Correlation: 
##    (Intr) xb    
## xb -0.295       
## xc -0.260  0.348
## 
## Standardized residuals:
##        Min         Q1        Med         Q3        Max 
## -3.2076462 -0.8804235 -0.1460402  0.8224632  3.4883858 
## 
## Residual standard error: 0.8591587 
## Degrees of freedom: 100 total; 97 residual

OK, that's it.

No comments:

Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.