Tuesday, April 25, 2017

ratebytree method for comparing evolutionary rates between trees taking into account sampling error in the estimation of species means

I just updated the phytools function ratebytree to permit the user to supply vectors of standard errors for each corresponding vector of trait values for the tips.

ratebytree, recall (1, 2 3), is a function that can be used to test whether the rate of evolution for a continuous character differs in different trees. This method is identical to Brian O'Meara et al.'s (2006) censored rate test, and should be closely related to an approach for comparing the rate of evolution for different traits that was proposed by Dean Adams (2013).

The function minimally takes as input an object of class "multiPhylo" containing our two or more trees, and a list of vectors, x, with trait values for each species in each corresponding tree.

If we have uncertainty in our species trait value estimates in x that will tend to have the fairly predictable effect of inflating our estimates of the evolutionary rate for our character, σ2. Furthermore, if this uncertainty differs among trees - then trees for which the trait has more uncertainty will tend to have disproproportionately overestimated evolutionary rates, causing us to be more likely to reject our null hypothesis of constant evolutionary rate across trees. I will demonstrate this below.

First, our updated code:

ratebytree<-function(trees,x,...){
    if(hasArg(tol)) tol<-list(...)$tol
    else tol<-1e-8
    if(hasArg(trace)) trace<-list(...)$trace
    else trace<-FALSE
    if(hasArg(digits)) digits<-list(...)$digits
    else digits<-6
    if(hasArg(test)) test<-list(...)$test
    else test<-"chisq"
    if(hasArg(quiet)) quiet<-list(...)$quiet
    else quiet<-FALSE
    ## check trees & x
    if(!inherits(trees,"multiPhylo")) 
        stop("trees should be object of class \"multiPhylo\".")
    if(!is.list(x)) stop("x should be a list of vectors.")
    if(hasArg(se)) se<-list(...)$se
    else {
        se<-x
        for(i in 1:length(x)) se[[i]][1:length(se[[i]])]<-0
    }
    N<-length(trees)
    ## reorder the trait vectors in x & SEs in se
    x<-mapply(function(x,t) x<-x[t$tip.label],x=x,t=trees,SIMPLIFY=FALSE)
    se<-mapply(function(x,t) x<-x[t$tip.label],x=se,t=trees,SIMPLIFY=FALSE)
    ## first, fit multi-rate model
    lik.multi<-function(theta,trees,x,se,trace=FALSE){
        n<-sapply(trees,Ntip)
        N<-length(trees)
        sig<-theta[1:N]
        a<-theta[(N+1):(2*N)]
        C<-lapply(trees,vcv)
        E<-lapply(se,diag)
        V<-mapply("+",mapply("*",C,sig,SIMPLIFY=FALSE),E,SIMPLIFY=FALSE)
        logL<-0
        for(i in 1:N) 
            logL<-logL-t(x[[i]]-a[i])%*%solve(V[[i]])%*%(x[[i]]-
                a[i])/2-n[i]*log(2*pi)/2-determinant(V[[i]])$modulus[1]/2
        if(trace){
            cat(paste(paste(round(sig,digits),collapse="\t"),round(logL,digits),"\n",sep="\t"))
            flush.console()
        }
        -logL
    }
    foo<-function(tree,x){ 
        pvcv<-phyl.vcv(as.matrix(x),vcv(tree),1)
        c(pvcv$R[1,1],pvcv$a[1,1])
    }
    PP<-mapply(foo,trees,x)
    if(hasArg(init)){ 
        init<-list(...)$init
        if(!is.null(init$sigm)) PP[1,]<-init$sigm
        if(!is.null(init$am)) PP[2,]<-init$am
    }
    p<-as.vector(t(PP))
    if(trace){
        cat("\nOptimizing multi-rate model....\n")
        cat(paste(paste("sig[",1:N,"]   ",sep="",collapse="\t"),"logL\n",sep="\t"))
    }
    fit.multi<-optim(p,lik.multi,trees=trees,x=x,se=se,trace=trace,method="L-BFGS-B",
        lower=c(rep(tol,N),rep(-Inf,N)),upper=c(rep(Inf,N),rep(Inf,N)))
    ## now fit single-rate model
    lik.onerate<-function(theta,trees,x,se,trace=FALSE){
        n<-sapply(trees,Ntip)
        N<-length(trees)
        sig<-theta[1]
        a<-theta[1:N+1]
        C<-lapply(trees,vcv)
        E<-lapply(se,diag)
        V<-mapply("+",lapply(C,"*",sig),E,SIMPLIFY=FALSE)
        logL<-0
        for(i in 1:N) 
            logL<-logL-t(x[[i]]-a[i])%*%solve(V[[i]])%*%(x[[i]]-
                a[i])/2-n[i]*log(2*pi)/2-determinant(V[[i]])$modulus[1]/2
        if(trace){ 
            cat(paste(round(sig,digits),round(logL,digits),"\n",sep="\t"))
            flush.console()
        }
        -logL
    }
    p<-c(mean(fit.multi$par[1:N]),fit.multi$par[(N+1):(2*N)])
    if(hasArg(init)){
        if(!is.null(init$sigc)) p[1]<-init$sigc
        if(!is.null(init$ac)) p[1:N+1]<-init$ac
    }
    if(trace){
        cat("\nOptimizing common-rate model....\n")
        cat(paste("sig      ","logL\n",sep="\t"))
    }
    fit.onerate<-optim(p,lik.onerate,trees=trees,x=x,se=se,trace=trace,method="L-BFGS-B",
        lower=c(tol,rep(-Inf,N)),upper=c(Inf,rep(Inf,N)))
    ## compare models:
    LR<-2*(-fit.multi$value+fit.onerate$value)
    if(test=="chisq") P.chisq<-pchisq(LR,df=N-1,lower.tail=FALSE)
    else if(test=="simulation"){
        if(!quiet) cat("Generating null distribution via simulation -> |")
        flush.console()
        if(hasArg(nsim)) nsim<-list(...)$nsim
        else nsim<-100
        X<-mapply(fastBM,tree=trees,a=as.list(fit.onerate$par[1:N+1]),
            MoreArgs=list(sig2=fit.onerate$par[1],nsim=nsim),
            SIMPLIFY=FALSE)
        P.sim<-1/(nsim+1)
        pct<-0.1
        for(i in 1:nsim){
            x.sim<-lapply(X,function(x,ind) x[,ind],ind=i)
            foo<-function(x,se) sampleFrom(xbar=x,xvar=se^2,n=rep(1,length(x)))
            x.sim<-mapply(foo,x=x.sim,se=se,SIMPLIFY=FALSE)
            fit.sim<-ratebytree(trees,x.sim,se=se)
            P.sim<-P.sim+(fit.sim$likelihood.ratio>=LR)/(nsim+1)
            if(i/nsim>=pct){
                if(!quiet) cat(".")
                flush.console()
                pct<-pct+0.1
            }
        }
        if(!quiet) cat(".|\nDone!\n")
        flush.console()
    }
    obj<-list(
        multi.rate.model=list(sig2=fit.multi$par[1:N],
            a=fit.multi$par[(N+1):(2*N)],
            k=2*N,
            logL=-fit.multi$value,
            counts=fit.multi$counts,convergence=fit.multi$convergence,
            message=fit.multi$message),
        common.rate.model=list(sig2=fit.onerate$par[1],
            a=fit.onerate$par[1:N+1],
            k=N+1,
            logL=-fit.onerate$value,
            counts=fit.onerate$counts,convergence=fit.onerate$convergence,
            message=fit.onerate$message),
        N=N,likelihood.ratio=LR,
        P.chisq=if(test=="chisq") P.chisq else NULL,
        P.sim=if(test=="simulation") P.sim else NULL)
    class(obj)<-"ratebytree"
    obj
}

The standard errors appear in our likelihood function in which we expect that the distribution of trait values among species for each tree will follow separate multivariate normal distributions. The covariances between traits are determined by the shared history between each pair of species in each tree, multiplied by its individual or common evolutionary rate; wherease the sampling error in the estimation of each species value adds additional variance to each species - or the diagonal of the among-species covariance matrix. For instance, in the multi-rate model we can code that as follows:

C<-lapply(trees,vcv)
E<-lapply(se,diag)
V<-mapply("+",mapply("*",C,sig,SIMPLIFY=FALSE),E,SIMPLIFY=FALSE)

Next, here are some data simulated with known sampling error:

library(phytools)
print(trees,details=TRUE)
## 3 phylogenetic trees
## tree 1 : 20 tips
## tree 2 : 30 tips
## tree 3 : 40 tips
x
## [[1]]
##         t4        t11        t12        t17        t18         t5 
##  2.6481447  0.4661793  1.0831414  1.2131310  0.3860602  2.2495161 
##         t6         t7        t13        t14        t15        t16 
##  2.3507374  1.4713536  1.1253945  1.6071969  0.1479299 -0.8062576 
##        t10        t19        t20         t8         t9         t3 
##  2.9102007  2.5560991  1.9184363 -1.4721703  0.5777698  0.1301554 
##         t2         t1 
## -0.1057433 -4.2611545 
## 
## [[2]]
##          t24          t25           t9          t13          t29 
##  1.988536465  1.915313924  0.299550155  4.292766521  0.653230195 
##          t30          t28           t6           t5          t12 
##  1.203715108  0.102751750 -0.645541964  1.968512086  1.727858475 
##          t26          t27          t19          t22          t23 
## -0.699650512 -0.133673530  0.953614294  2.512608107  1.821875227 
##          t14          t15          t16           t8           t4 
## -3.685948987  3.233012941  2.100277307 -0.437167966  2.101576587 
##           t1           t2           t3          t10          t11 
##  0.003862844 -2.059539697  1.702781561  1.938737185  0.317377403 
##           t7          t20          t21          t17          t18 
##  3.434024402  2.742622012  2.170338120  2.536667937  2.051949323 
## 
## [[3]]
##         t13         t14          t4         t11         t12          t7 
## -3.73565881 -5.35580344 -1.92232141 -1.80664083 -3.56235896  1.19405969 
##         t18         t19         t39         t40          t8         t25 
## -2.52512058 -0.90457046  3.73506309  3.76850138 -3.04353994 -1.38045826 
##         t26         t22         t21         t35         t36         t33 
## -1.84399585 -0.29081302  1.05979926 -0.57876241 -0.70920794 -0.30951057 
##         t34         t31         t32         t37         t38         t23 
## -0.86259743 -0.05510418 -0.96796909 -0.85093761 -1.01844679 -2.33523568 
##         t24         t27         t28         t17          t6          t5 
## -1.77279212 -0.14813944 -1.30408459  1.24565151  3.55008328  1.30945303 
##          t3          t9         t10          t1         t15         t16 
##  1.42734589 -2.45389062 -4.79738419  0.46069669 -5.50885861  0.95661827 
##         t29         t30         t20          t2 
## -4.78770716 -4.09772389 -1.29677247  0.18260431
se
## [[1]]
##        t4       t11       t12       t17       t18        t5        t6 
## 0.1832955 0.2055037 0.9259859 0.2544893 0.6866276 0.1065910 0.8189919 
##        t7       t13       t14       t15       t16       t10       t19 
## 0.8565421 1.3646163 0.6294259 0.2882527 0.7556372 0.3189413 0.4353811 
##       t20        t8        t9        t3        t2        t1 
## 0.6819633 0.9088295 0.7724640 1.0819755 0.4402289 0.7797574 
## 
## [[2]]
##       t24       t25        t9       t13       t29       t30       t28 
## 0.7009487 0.5644663 0.1654113 0.8073011 1.0681880 0.7202309 0.9569773 
##        t6        t5       t12       t26       t27       t19       t22 
## 1.1685051 0.5402671 0.5715983 0.2655963 0.1554442 0.2401619 0.8473433 
##       t23       t14       t15       t16        t8        t4        t1 
## 0.1048537 0.8839268 0.5172862 0.3591575 0.4931149 0.3827068 0.8414095 
##        t2        t3       t10       t11        t7       t20       t21 
## 0.7907942 0.4937752 1.7346567 0.7767892 1.0629614 0.9776910 0.6644410 
##       t17       t18 
## 0.6038626 0.5533107 
## 
## [[3]]
##        t13        t14         t4        t11        t12         t7 
## 0.83521027 1.20944000 0.59861186 0.93622996 0.45013743 0.32601858 
##        t18        t19        t39        t40         t8        t25 
## 0.40115329 0.04991500 0.89637040 0.18988455 0.79793769 1.28376830 
##        t26        t22        t21        t35        t36        t33 
## 0.72668364 0.93176376 0.51124161 0.73754307 0.40678985 0.78433763 
##        t34        t31        t32        t37        t38        t23 
## 0.50845162 0.69661077 1.09500529 0.70244537 0.40624945 0.86717621 
##        t24        t27        t28        t17         t6         t5 
## 0.36704605 0.91854203 0.97549965 0.67910275 0.26884018 0.39211593 
##         t3         t9        t10         t1        t15        t16 
## 0.61850904 0.09206774 0.53550511 0.75619233 0.95858204 1.08994091 
##        t29        t30        t20         t2 
## 0.61255500 0.72990224 0.09216722 0.49320373

Let's fit our models - first ignoring sampling error, then taking it into account:

ratebytree(trees,x)
## ML common-rate model:
##  s^2  a[1]   a[2]    a[3]    k   logL
## value    3.4586  -0.1385 1.2875  -1.4435 4   -175.5111   
## 
## ML multi-rate model:
##   s^2[1] s^2[2]  s^2[3]   a[1]   a[2]    a[3]    k   logL
## value    2.2176  3.9121  3.7387  -0.1385 1.2875  -1.4435 6   -174.4738   
## 
## Likelihood ratio: 2.0747 
## P-value (based on X^2): 0.3544 
## 
## R thinks it has found the ML solution.
ratebytree(trees,x,se=se)
## ML common-rate model:
##  s^2  a[1]   a[2]    a[3]    k   logL
## value    2.2909  -0.0646 1.2665  -1.4029 4   -178.1981   
## 
## ML multi-rate model:
##   s^2[1] s^2[2]  s^2[3]   a[1]   a[2]    a[3]    k   logL
## value    0.9373  1.986   3.211   0.0235  1.2655  -1.4134 6   -175.6829   
## 
## Likelihood ratio: 5.0304 
## P-value (based on X^2): 0.0808 
## 
## R thinks it has found the ML solution.

These data were simulated with a difference of rate & with σ12=1, σ22=2, and σ32=3, so we can see that our estimates are much more accurate when we account for error in the estimation of species means.

(By the way, I simulated the data & trees as follows.)

t1<-pbtree(n=20)
t2<-pbtree(n=30)
t3<-pbtree(n=40)
se.x<-sqrt(setNames(rexp(n=Ntip(t1),rate=2),t1$tip.label))
se.y<-sqrt(setNames(rexp(n=Ntip(t2),rate=2),t2$tip.label))
se.z<-sqrt(setNames(rexp(n=Ntip(t3),rate=2),t3$tip.label))
x<-sampleFrom(fastBM(t1,sig2=1),se.x^2,n=rep(1,Ntip(t1)))
y<-sampleFrom(fastBM(t2,sig2=2),se.y^2,n=rep(1,Ntip(t2)))
z<-sampleFrom(fastBM(t3,sig2=3),se.z^2,n=rep(1,Ntip(t3)))
trees<-c(t1,t2,t3)
x<-list(x,y,z)
se<-list(se.x,se.y,se.z)

Next, we can conduct simulations under the null hypothesis to examine the the type I error rate that results when we have different mean sampling variances for our trait values in our different clades, no real difference in rate, but we fail to take this sampling error into consideration.

For this simulation, I will draw sampling variances from an exponential distribution, as before, but with rate (λ) set to a fixed value of 3 for tree 1 of each simulation, but varied between 1 & 10 for tree 2.

N1<-N2<-26
nsim<-200
t1<-pbtree(n=N1,scale=1,nsim=nsim)
t2<-pbtree(n=N2,scale=1,nsim=nsim)
r1<-rep(3,10)
r2<-1:10
P1<-P2<-matrix(0,nsim,length(r1),dimnames=list(1:nsim,round(r2/r1,3)))
for(i in 1:length(r1)){
    for(j in 1:length(t1)){
        se.x<-setNames(sqrt(rexp(n=N1,rate=r1[i])),t1[[j]]$tip.label)
        se.y<-setNames(sqrt(rexp(n=N2,rate=r2[i])),t2[[j]]$tip.label)
        x<-sampleFrom(fastBM(t1[[j]],sig2=1),se.x^2,n=rep(1,N1))
        y<-sampleFrom(fastBM(t2[[j]],sig2=1),se.y^2,n=rep(1,N2))
        fit1<-ratebytree(trees=c(t1[[j]],t2[[j]]),x=list(x,y))
        fit2<-ratebytree(trees=c(t1[[j]],t2[[j]]),x=list(x,y),se=list(se.x,se.y))
        P1[j,i]<-fit1$P.chisq
        P2[j,i]<-fit2$P.chisq
    }
    cat(paste("Done r2/r1 =",round(r2[i]/r1[i],3),"\n"))
}
## Done r2/r1 = 0.333 
## Done r2/r1 = 0.667 
## Done r2/r1 = 1 
## Done r2/r1 = 1.333 
## Done r2/r1 = 1.667 
## Done r2/r1 = 2 
## Done r2/r1 = 2.333 
## Done r2/r1 = 2.667 
## Done r2/r1 = 3 
## Done r2/r1 = 3.333
typeI.1<-colMeans(P1<=0.05)
typeI.2<-colMeans(P2<=0.05)
plot(r2/r1,typeI.1,type="b",ylim=c(0,1),pch=21,bg="grey",lwd=2,
    ylab="type I error rate",xlab=expression(paste(lambda[2]/lambda[1],
    " for exponentially distributed sampling errors")))
lines(r2/r1,typeI.2,type="b",lty="dotted",pch=21,bg="grey",lwd=2)
abline(h=0.05,lty="dashed",col="red",lwd=2)
legend(x=2.6,y=1,legend=c("no SE","with SE","0.05"),
    lty=c("solid","dotted","dashed"),lwd=2,col=c("black","black","red"))

plot of chunk unnamed-chunk-6

Hopefully we see that the type I error is elevated above it's nominal for all circumstances in which data have been simulated with sampling error - but this has not been taken into account; and, furthermore, that this bias is greatest when sampling error differs systematically between trees (that is, when λ21≠1.0). Note that the type I error rate may also be elevated above the nominal level of 0.05 even with sampling error accounted for. This is most likely due to the fact that the theory of likelihoods tells us that the likelihood-ratio test statistic is only asymptotically χ2 distributed for large N, and we are dealing with relatively small trees in this simulation. Hopefully this rate would decrease to approximately 0.05 for larger N or if we obtained the null distribution of the test statistic via simulation (an option that is already implemented in ratebytree, although it requires substantially more computation).

That's it!

No comments:

Post a Comment