I just
updated
the phytools function ratebytree
to permit the user to supply vectors of standard errors for
each corresponding vector of trait values for the tips.
ratebytree
, recall (1,
2
3), is a function that
can be used to test whether the rate of evolution for a continuous character differs in different trees. This
method is identical to Brian O'Meara et al.'s (2006)
censored rate test, and should be closely related to an approach for comparing the rate of evolution for
different traits that was proposed by Dean Adams (2013).
The function minimally takes as input an object of class "multiPhylo"
containing our two or more
trees, and a list of vectors, x
, with trait values for each species in each corresponding tree.
If we have uncertainty in our species trait value estimates in x
that will tend to have the fairly
predictable effect of inflating our estimates of the evolutionary rate for our character, σ2.
Furthermore, if this uncertainty differs among trees - then trees for which the trait has more uncertainty will
tend to have disproproportionately overestimated evolutionary rates, causing us to be more likely to reject our
null hypothesis of constant evolutionary rate across trees. I will demonstrate this below.
First, our updated code:
ratebytree<-function(trees,x,...){
if(hasArg(tol)) tol<-list(...)$tol
else tol<-1e-8
if(hasArg(trace)) trace<-list(...)$trace
else trace<-FALSE
if(hasArg(digits)) digits<-list(...)$digits
else digits<-6
if(hasArg(test)) test<-list(...)$test
else test<-"chisq"
if(hasArg(quiet)) quiet<-list(...)$quiet
else quiet<-FALSE
## check trees & x
if(!inherits(trees,"multiPhylo"))
stop("trees should be object of class \"multiPhylo\".")
if(!is.list(x)) stop("x should be a list of vectors.")
if(hasArg(se)) se<-list(...)$se
else {
se<-x
for(i in 1:length(x)) se[[i]][1:length(se[[i]])]<-0
}
N<-length(trees)
## reorder the trait vectors in x & SEs in se
x<-mapply(function(x,t) x<-x[t$tip.label],x=x,t=trees,SIMPLIFY=FALSE)
se<-mapply(function(x,t) x<-x[t$tip.label],x=se,t=trees,SIMPLIFY=FALSE)
## first, fit multi-rate model
lik.multi<-function(theta,trees,x,se,trace=FALSE){
n<-sapply(trees,Ntip)
N<-length(trees)
sig<-theta[1:N]
a<-theta[(N+1):(2*N)]
C<-lapply(trees,vcv)
E<-lapply(se,diag)
V<-mapply("+",mapply("*",C,sig,SIMPLIFY=FALSE),E,SIMPLIFY=FALSE)
logL<-0
for(i in 1:N)
logL<-logL-t(x[[i]]-a[i])%*%solve(V[[i]])%*%(x[[i]]-
a[i])/2-n[i]*log(2*pi)/2-determinant(V[[i]])$modulus[1]/2
if(trace){
cat(paste(paste(round(sig,digits),collapse="\t"),round(logL,digits),"\n",sep="\t"))
flush.console()
}
-logL
}
foo<-function(tree,x){
pvcv<-phyl.vcv(as.matrix(x),vcv(tree),1)
c(pvcv$R[1,1],pvcv$a[1,1])
}
PP<-mapply(foo,trees,x)
if(hasArg(init)){
init<-list(...)$init
if(!is.null(init$sigm)) PP[1,]<-init$sigm
if(!is.null(init$am)) PP[2,]<-init$am
}
p<-as.vector(t(PP))
if(trace){
cat("\nOptimizing multi-rate model....\n")
cat(paste(paste("sig[",1:N,"] ",sep="",collapse="\t"),"logL\n",sep="\t"))
}
fit.multi<-optim(p,lik.multi,trees=trees,x=x,se=se,trace=trace,method="L-BFGS-B",
lower=c(rep(tol,N),rep(-Inf,N)),upper=c(rep(Inf,N),rep(Inf,N)))
## now fit single-rate model
lik.onerate<-function(theta,trees,x,se,trace=FALSE){
n<-sapply(trees,Ntip)
N<-length(trees)
sig<-theta[1]
a<-theta[1:N+1]
C<-lapply(trees,vcv)
E<-lapply(se,diag)
V<-mapply("+",lapply(C,"*",sig),E,SIMPLIFY=FALSE)
logL<-0
for(i in 1:N)
logL<-logL-t(x[[i]]-a[i])%*%solve(V[[i]])%*%(x[[i]]-
a[i])/2-n[i]*log(2*pi)/2-determinant(V[[i]])$modulus[1]/2
if(trace){
cat(paste(round(sig,digits),round(logL,digits),"\n",sep="\t"))
flush.console()
}
-logL
}
p<-c(mean(fit.multi$par[1:N]),fit.multi$par[(N+1):(2*N)])
if(hasArg(init)){
if(!is.null(init$sigc)) p[1]<-init$sigc
if(!is.null(init$ac)) p[1:N+1]<-init$ac
}
if(trace){
cat("\nOptimizing common-rate model....\n")
cat(paste("sig ","logL\n",sep="\t"))
}
fit.onerate<-optim(p,lik.onerate,trees=trees,x=x,se=se,trace=trace,method="L-BFGS-B",
lower=c(tol,rep(-Inf,N)),upper=c(Inf,rep(Inf,N)))
## compare models:
LR<-2*(-fit.multi$value+fit.onerate$value)
if(test=="chisq") P.chisq<-pchisq(LR,df=N-1,lower.tail=FALSE)
else if(test=="simulation"){
if(!quiet) cat("Generating null distribution via simulation -> |")
flush.console()
if(hasArg(nsim)) nsim<-list(...)$nsim
else nsim<-100
X<-mapply(fastBM,tree=trees,a=as.list(fit.onerate$par[1:N+1]),
MoreArgs=list(sig2=fit.onerate$par[1],nsim=nsim),
SIMPLIFY=FALSE)
P.sim<-1/(nsim+1)
pct<-0.1
for(i in 1:nsim){
x.sim<-lapply(X,function(x,ind) x[,ind],ind=i)
foo<-function(x,se) sampleFrom(xbar=x,xvar=se^2,n=rep(1,length(x)))
x.sim<-mapply(foo,x=x.sim,se=se,SIMPLIFY=FALSE)
fit.sim<-ratebytree(trees,x.sim,se=se)
P.sim<-P.sim+(fit.sim$likelihood.ratio>=LR)/(nsim+1)
if(i/nsim>=pct){
if(!quiet) cat(".")
flush.console()
pct<-pct+0.1
}
}
if(!quiet) cat(".|\nDone!\n")
flush.console()
}
obj<-list(
multi.rate.model=list(sig2=fit.multi$par[1:N],
a=fit.multi$par[(N+1):(2*N)],
k=2*N,
logL=-fit.multi$value,
counts=fit.multi$counts,convergence=fit.multi$convergence,
message=fit.multi$message),
common.rate.model=list(sig2=fit.onerate$par[1],
a=fit.onerate$par[1:N+1],
k=N+1,
logL=-fit.onerate$value,
counts=fit.onerate$counts,convergence=fit.onerate$convergence,
message=fit.onerate$message),
N=N,likelihood.ratio=LR,
P.chisq=if(test=="chisq") P.chisq else NULL,
P.sim=if(test=="simulation") P.sim else NULL)
class(obj)<-"ratebytree"
obj
}
The standard errors appear in our likelihood function in which we expect that the distribution of trait values among species for each tree will follow separate multivariate normal distributions. The covariances between traits are determined by the shared history between each pair of species in each tree, multiplied by its individual or common evolutionary rate; wherease the sampling error in the estimation of each species value adds additional variance to each species - or the diagonal of the among-species covariance matrix. For instance, in the multi-rate model we can code that as follows:
C<-lapply(trees,vcv)
E<-lapply(se,diag)
V<-mapply("+",mapply("*",C,sig,SIMPLIFY=FALSE),E,SIMPLIFY=FALSE)
Next, here are some data simulated with known sampling error:
library(phytools)
print(trees,details=TRUE)
## 3 phylogenetic trees
## tree 1 : 20 tips
## tree 2 : 30 tips
## tree 3 : 40 tips
x
## [[1]]
## t4 t11 t12 t17 t18 t5
## 2.6481447 0.4661793 1.0831414 1.2131310 0.3860602 2.2495161
## t6 t7 t13 t14 t15 t16
## 2.3507374 1.4713536 1.1253945 1.6071969 0.1479299 -0.8062576
## t10 t19 t20 t8 t9 t3
## 2.9102007 2.5560991 1.9184363 -1.4721703 0.5777698 0.1301554
## t2 t1
## -0.1057433 -4.2611545
##
## [[2]]
## t24 t25 t9 t13 t29
## 1.988536465 1.915313924 0.299550155 4.292766521 0.653230195
## t30 t28 t6 t5 t12
## 1.203715108 0.102751750 -0.645541964 1.968512086 1.727858475
## t26 t27 t19 t22 t23
## -0.699650512 -0.133673530 0.953614294 2.512608107 1.821875227
## t14 t15 t16 t8 t4
## -3.685948987 3.233012941 2.100277307 -0.437167966 2.101576587
## t1 t2 t3 t10 t11
## 0.003862844 -2.059539697 1.702781561 1.938737185 0.317377403
## t7 t20 t21 t17 t18
## 3.434024402 2.742622012 2.170338120 2.536667937 2.051949323
##
## [[3]]
## t13 t14 t4 t11 t12 t7
## -3.73565881 -5.35580344 -1.92232141 -1.80664083 -3.56235896 1.19405969
## t18 t19 t39 t40 t8 t25
## -2.52512058 -0.90457046 3.73506309 3.76850138 -3.04353994 -1.38045826
## t26 t22 t21 t35 t36 t33
## -1.84399585 -0.29081302 1.05979926 -0.57876241 -0.70920794 -0.30951057
## t34 t31 t32 t37 t38 t23
## -0.86259743 -0.05510418 -0.96796909 -0.85093761 -1.01844679 -2.33523568
## t24 t27 t28 t17 t6 t5
## -1.77279212 -0.14813944 -1.30408459 1.24565151 3.55008328 1.30945303
## t3 t9 t10 t1 t15 t16
## 1.42734589 -2.45389062 -4.79738419 0.46069669 -5.50885861 0.95661827
## t29 t30 t20 t2
## -4.78770716 -4.09772389 -1.29677247 0.18260431
se
## [[1]]
## t4 t11 t12 t17 t18 t5 t6
## 0.1832955 0.2055037 0.9259859 0.2544893 0.6866276 0.1065910 0.8189919
## t7 t13 t14 t15 t16 t10 t19
## 0.8565421 1.3646163 0.6294259 0.2882527 0.7556372 0.3189413 0.4353811
## t20 t8 t9 t3 t2 t1
## 0.6819633 0.9088295 0.7724640 1.0819755 0.4402289 0.7797574
##
## [[2]]
## t24 t25 t9 t13 t29 t30 t28
## 0.7009487 0.5644663 0.1654113 0.8073011 1.0681880 0.7202309 0.9569773
## t6 t5 t12 t26 t27 t19 t22
## 1.1685051 0.5402671 0.5715983 0.2655963 0.1554442 0.2401619 0.8473433
## t23 t14 t15 t16 t8 t4 t1
## 0.1048537 0.8839268 0.5172862 0.3591575 0.4931149 0.3827068 0.8414095
## t2 t3 t10 t11 t7 t20 t21
## 0.7907942 0.4937752 1.7346567 0.7767892 1.0629614 0.9776910 0.6644410
## t17 t18
## 0.6038626 0.5533107
##
## [[3]]
## t13 t14 t4 t11 t12 t7
## 0.83521027 1.20944000 0.59861186 0.93622996 0.45013743 0.32601858
## t18 t19 t39 t40 t8 t25
## 0.40115329 0.04991500 0.89637040 0.18988455 0.79793769 1.28376830
## t26 t22 t21 t35 t36 t33
## 0.72668364 0.93176376 0.51124161 0.73754307 0.40678985 0.78433763
## t34 t31 t32 t37 t38 t23
## 0.50845162 0.69661077 1.09500529 0.70244537 0.40624945 0.86717621
## t24 t27 t28 t17 t6 t5
## 0.36704605 0.91854203 0.97549965 0.67910275 0.26884018 0.39211593
## t3 t9 t10 t1 t15 t16
## 0.61850904 0.09206774 0.53550511 0.75619233 0.95858204 1.08994091
## t29 t30 t20 t2
## 0.61255500 0.72990224 0.09216722 0.49320373
Let's fit our models - first ignoring sampling error, then taking it into account:
ratebytree(trees,x)
## ML common-rate model:
## s^2 a[1] a[2] a[3] k logL
## value 3.4586 -0.1385 1.2875 -1.4435 4 -175.5111
##
## ML multi-rate model:
## s^2[1] s^2[2] s^2[3] a[1] a[2] a[3] k logL
## value 2.2176 3.9121 3.7387 -0.1385 1.2875 -1.4435 6 -174.4738
##
## Likelihood ratio: 2.0747
## P-value (based on X^2): 0.3544
##
## R thinks it has found the ML solution.
ratebytree(trees,x,se=se)
## ML common-rate model:
## s^2 a[1] a[2] a[3] k logL
## value 2.2909 -0.0646 1.2665 -1.4029 4 -178.1981
##
## ML multi-rate model:
## s^2[1] s^2[2] s^2[3] a[1] a[2] a[3] k logL
## value 0.9373 1.986 3.211 0.0235 1.2655 -1.4134 6 -175.6829
##
## Likelihood ratio: 5.0304
## P-value (based on X^2): 0.0808
##
## R thinks it has found the ML solution.
These data were simulated with a difference of rate & with σ12=1, σ22=2, and σ32=3, so we can see that our estimates are much more accurate when we account for error in the estimation of species means.
(By the way, I simulated the data & trees as follows.)
t1<-pbtree(n=20)
t2<-pbtree(n=30)
t3<-pbtree(n=40)
se.x<-sqrt(setNames(rexp(n=Ntip(t1),rate=2),t1$tip.label))
se.y<-sqrt(setNames(rexp(n=Ntip(t2),rate=2),t2$tip.label))
se.z<-sqrt(setNames(rexp(n=Ntip(t3),rate=2),t3$tip.label))
x<-sampleFrom(fastBM(t1,sig2=1),se.x^2,n=rep(1,Ntip(t1)))
y<-sampleFrom(fastBM(t2,sig2=2),se.y^2,n=rep(1,Ntip(t2)))
z<-sampleFrom(fastBM(t3,sig2=3),se.z^2,n=rep(1,Ntip(t3)))
trees<-c(t1,t2,t3)
x<-list(x,y,z)
se<-list(se.x,se.y,se.z)
Next, we can conduct simulations under the null hypothesis to examine the the type I error rate that results when we have different mean sampling variances for our trait values in our different clades, no real difference in rate, but we fail to take this sampling error into consideration.
For this simulation, I will draw sampling variances from an exponential distribution, as before, but with rate (λ) set to a fixed value of 3 for tree 1 of each simulation, but varied between 1 & 10 for tree 2.
N1<-N2<-26
nsim<-200
t1<-pbtree(n=N1,scale=1,nsim=nsim)
t2<-pbtree(n=N2,scale=1,nsim=nsim)
r1<-rep(3,10)
r2<-1:10
P1<-P2<-matrix(0,nsim,length(r1),dimnames=list(1:nsim,round(r2/r1,3)))
for(i in 1:length(r1)){
for(j in 1:length(t1)){
se.x<-setNames(sqrt(rexp(n=N1,rate=r1[i])),t1[[j]]$tip.label)
se.y<-setNames(sqrt(rexp(n=N2,rate=r2[i])),t2[[j]]$tip.label)
x<-sampleFrom(fastBM(t1[[j]],sig2=1),se.x^2,n=rep(1,N1))
y<-sampleFrom(fastBM(t2[[j]],sig2=1),se.y^2,n=rep(1,N2))
fit1<-ratebytree(trees=c(t1[[j]],t2[[j]]),x=list(x,y))
fit2<-ratebytree(trees=c(t1[[j]],t2[[j]]),x=list(x,y),se=list(se.x,se.y))
P1[j,i]<-fit1$P.chisq
P2[j,i]<-fit2$P.chisq
}
cat(paste("Done r2/r1 =",round(r2[i]/r1[i],3),"\n"))
}
## Done r2/r1 = 0.333
## Done r2/r1 = 0.667
## Done r2/r1 = 1
## Done r2/r1 = 1.333
## Done r2/r1 = 1.667
## Done r2/r1 = 2
## Done r2/r1 = 2.333
## Done r2/r1 = 2.667
## Done r2/r1 = 3
## Done r2/r1 = 3.333
typeI.1<-colMeans(P1<=0.05)
typeI.2<-colMeans(P2<=0.05)
plot(r2/r1,typeI.1,type="b",ylim=c(0,1),pch=21,bg="grey",lwd=2,
ylab="type I error rate",xlab=expression(paste(lambda[2]/lambda[1],
" for exponentially distributed sampling errors")))
lines(r2/r1,typeI.2,type="b",lty="dotted",pch=21,bg="grey",lwd=2)
abline(h=0.05,lty="dashed",col="red",lwd=2)
legend(x=2.6,y=1,legend=c("no SE","with SE","0.05"),
lty=c("solid","dotted","dashed"),lwd=2,col=c("black","black","red"))
Hopefully we see that the type I error is elevated above it's nominal for all circumstances in which data have been
simulated with sampling error - but this has not been taken into account; and, furthermore, that this bias is greatest
when sampling error differs systematically between trees (that is, when λ2/λ1≠1.0).
Note that the type I error rate may also be elevated above the nominal level of 0.05 even with sampling error accounted
for. This is most likely due to the fact that the theory of likelihoods tells us that the likelihood-ratio test statistic
is only asymptotically χ2 distributed for large N, and we are dealing with relatively small trees
in this simulation. Hopefully this rate would decrease to approximately 0.05 for larger N or if we obtained the
null distribution of the test statistic via simulation (an option that is already implemented in ratebytree
,
although it requires substantially more computation).
That's it!
No comments:
Post a Comment
Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.