Yesterday
I posted about an extension of the phytools function ratebytree
to fit different rates
of discrete character evolution to different trees, and then to compare this model to one in which
the rates of evolutionary change between states have been forced to be the same.
I thought it might be neat to do some very simple analyses of the type I error rate of the method -
as well as of its power to detect a difference in rate when one has been simulated. This is something
that we have done (more thoroughtly) in our manuscript describing the statistical properties of
ratebytree
for continuous traits.
To accomplish this, I will first need to simulate under the null & measure the fraction of times amongst my simulations for which the null is rejected (type I errors). Next, I will simulate under various differences in evolutionary rate between trees and measure the fraction of times that the null hypothesis is rejected when false (power).
Here I go. First, with type I error:
library(phytools)
packageVersion("phytools") ## check we have 0.6.25 or above
## [1] '0.6.25'
## simulate pairs of trees
trees<-replicate(200,pbtree(n=100,nsim=2),simplify=FALSE)
head(trees,n=3)
## [[1]]
## 2 phylogenetic trees
##
## [[2]]
## 2 phylogenetic trees
##
## [[3]]
## 2 phylogenetic trees
## now simulate pairs of datasets on each of these trees
Q<-matrix(c(-1,1,1,-1),2,2,dimnames=list(letters[1:2],
letters[1:2]))
foo<-function(t,Q)
list(as.factor(sim.history(t[[1]],Q,message=F)$states),
as.factor(sim.history(t[[2]],Q,message=F)$states))
x<-lapply(trees,foo,Q=Q)
head(x,n=3)
## [[1]]
## [[1]][[1]]
## t6 t52 t99 t100 t87 t88 t67 t68 t20 t95 t96 t81 t82 t21 t22
## a a a a b b b a b a a a a b b
## t15 t1 t2 t47 t48 t75 t76 t69 t70 t91 t92 t73 t74 t46 t12
## b b b b b b b a a a a b a a a
## t16 t17 t44 t97 t98 t63 t32 t24 t27 t77 t78 t3 t18 t19 t64
## b a a b b a a b a a a b a a b
## t83 t84 t8 t10 t56 t57 t42 t43 t23 t38 t39 t40 t41 t33 t71
## b a b a b a b a a b b a b a a
## t72 t28 t25 t13 t30 t31 t5 t61 t62 t29 t89 t90 t85 t49 t36
## a a a a a a b b a a b a b a b
## t37 t26 t79 t80 t53 t54 t93 t94 t86 t45 t50 t51 t14 t58 t59
## b b a a b a a a a b b b a a a
## t7 t4 t9 t34 t35 t11 t55 t60 t65 t66
## a a a a a b a a b a
## Levels: a b
##
## [[1]][[2]]
## t26 t51 t52 t16 t91 t92 t28 t24 t43 t44 t67 t99 t100 t45 t46
## a a a a b b b b b b b b b b b
## t76 t83 t84 t74 t75 t20 t55 t56 t5 t6 t63 t64 t48 t49 t72
## a a a a a a a a b b a a a a a
## t73 t25 t36 t59 t60 t40 t41 t68 t69 t54 t47 t4 t13 t17 t18
## a b a a b a a a a a a a b a a
## t79 t87 t88 t37 t35 t8 t29 t30 t89 t90 t42 t12 t9 t38 t39
## b a a a a b b b a a b a b b b
## t14 t95 t96 t11 t31 t32 t85 t86 t19 t97 t98 t82 t53 t61 t62
## a b b b b a b b b a a a b b a
## t10 t2 t57 t58 t7 t70 t71 t3 t33 t34 t50 t77 t78 t65 t66
## a b a a b b b b b b a b b b a
## t27 t80 t81 t1 t21 t93 t94 t22 t23 t15
## b b b a b b b b a a
## Levels: a b
##
##
## [[2]]
## [[2]][[1]]
## t40 t97 t98 t66 t67 t53 t43 t80 t81 t27 t54 t55 t10 t71 t72
## b a a a a a a a a b b b a a a
## t62 t63 t41 t42 t50 t75 t85 t86 t12 t7 t60 t61 t58 t59 t19
## a b b b b b b b a a a a a a a
## t26 t46 t47 t35 t37 t91 t92 t24 t1 t20 t21 t38 t39 t87 t88
## a a a a a a a b a a a b b a a
## t28 t5 t3 t4 t73 t74 t9 t2 t36 t48 t49 t82 t83 t84 t18
## a a b a a a a a b a a a a a a
## t29 t30 t17 t51 t52 t64 t65 t22 t23 t11 t70 t76 t77 t78 t79
## a b b b b a a a b b a b a a a
## t93 t94 t14 t56 t57 t68 t69 t25 t89 t90 t34 t13 t15 t16 t31
## a b a a a a a a a b a a b b a
## t32 t33 t99 t100 t95 t96 t6 t44 t45 t8
## a a a a a a a b b a
## Levels: a b
##
## [[2]][[2]]
## t53 t54 t61 t62 t11 t8 t3 t51 t52 t88 t89 t50 t99 t100 t27
## a a a a b b b b b a a b b b b
## t17 t18 t80 t81 t69 t70 t55 t12 t6 t56 t57 t75 t76 t68 t82
## a b b b b b b a a a a a a b a
## t83 t49 t93 t94 t90 t16 t23 t59 t60 t34 t28 t29 t32 t33 t24
## a b a a a b b a a a b b b b a
## t47 t48 t21 t4 t2 t9 t25 t26 t58 t95 t96 t42 t43 t7 t66
## b b a a a b b b a a a b a a b
## t67 t77 t97 t98 t35 t36 t1 t37 t38 t22 t41 t71 t72 t46 t86
## a b a a b a a b a b b b a b b
## t87 t73 t74 t44 t45 t39 t40 t13 t91 t92 t14 t15 t64 t65 t63
## b b b b b b b a a a a b b b b
## t84 t85 t5 t10 t30 t31 t78 t79 t19 t20
## b b a a b b a a a a
## Levels: a b
##
##
## [[3]]
## [[3]][[1]]
## t56 t57 t71 t72 t97 t98 t77 t78 t4 t2 t59 t60 t39 t20 t26
## a a b a b b b b b a a a a b b
## t27 t10 t13 t14 t9 t30 t89 t90 t67 t68 t51 t15 t36 t37 t11
## a b a b b a a a b b b a b b a
## t49 t75 t76 t55 t31 t44 t45 t8 t53 t54 t48 t91 t92 t40 t41
## b a a b b a b b b b b b b a b
## t17 t46 t47 t95 t96 t29 t25 t18 t19 t6 t1 t23 t24 t73 t74
## b b a a a a a b a a a a b b b
## t7 t22 t58 t79 t80 t81 t82 t16 t52 t69 t70 t61 t62 t85 t86
## a b b b b b b b b a a b b a a
## t83 t84 t33 t34 t99 t100 t50 t32 t28 t93 t94 t35 t38 t63 t64
## b b a a b b b a b a a a a a a
## t87 t88 t42 t43 t21 t12 t5 t3 t65 t66
## a a b b b b b a a a
## Levels: a b
##
## [[3]][[2]]
## t8 t9 t58 t63 t64 t73 t92 t93 t69 t56 t57 t25 t60 t85 t86
## b b a a b b b a b b b b b b b
## t7 t65 t66 t37 t38 t2 t12 t49 t54 t55 t18 t15 t16 t36 t83
## a a b b b a b a a a a b a a a
## t84 t81 t82 t27 t89 t90 t6 t74 t75 t59 t76 t77 t47 t48 t98
## a b a a a a b a a a b b b b b
## t99 t100 t70 t94 t95 t91 t79 t80 t29 t23 t78 t96 t97 t1 t21
## b b a b b b a a a b b b b b a
## t22 t10 t43 t44 t14 t52 t53 t19 t32 t33 t17 t28 t30 t31 t39
## a b a b b b a b b a a a b a b
## t40 t71 t72 t61 t62 t20 t11 t4 t50 t51 t5 t45 t46 t26 t34
## b b b a b b a a a b b b a b a
## t35 t3 t13 t24 t41 t42 t87 t88 t67 t68
## a b b a b b b b a b
## Levels: a b
## now fit our models
fitsER<-mapply(ratebytree,trees=trees,x=x,MoreArgs=list(model="ER"),
SIMPLIFY=FALSE)
head(fitsER,n=3)
## [[1]]
## ML common-rate model:
## q k logL
## value 0.7339 1 -125.016
##
## Model fitting method was "optimize".
##
## ML multi-rate model:
## q k logL
## tree1 1.2666 2 -124.1744
## tree2 0.5602
##
## Model fitting method was "nlminb".
##
## Likelihood ratio: 1.6831
## P-value (based on X^2): 0.1945
##
##
## [[2]]
## ML common-rate model:
## q k logL
## value 0.5811 1 -117.8442
##
## Model fitting method was "optimize".
##
## ML multi-rate model:
## q k logL
## tree1 0.4412 2 -117.5178
## tree2 0.687
##
## Model fitting method was "nlminb".
##
## Likelihood ratio: 0.6527
## P-value (based on X^2): 0.4191
##
##
## [[3]]
## ML common-rate model:
## q k logL
## value 1.0253 1 -123.4968
##
## Model fitting method was "optimize".
##
## ML multi-rate model:
## q k logL
## tree1 0.8676 2 -123.228
## tree2 1.3832
##
## Model fitting method was "nlminb".
##
## Likelihood ratio: 0.5376
## P-value (based on X^2): 0.4634
## finally, pull out the P-values for each analysis:
P<-sapply(fitsER,function(x) x$P.chisq)
obj<-hist(P,breaks=seq(0,1,by=0.05),plot=F)
plot(obj$mids,obj$counts/sum(obj$counts),type="h",
lwd=18,col=make.transparent("blue",0.4),lend=1,
xlab="P-value",ylab="relative frequency",ylim=c(0,0.2))
title(main="P-values for data simulated under the null")
abline(h=0.05,col=make.transparent("red",0.5),lwd=4,
lty="dashed",lend=1)
mean(P<=0.05)
## [1] 0.06
We can see that the type I error rate is elevated above its nominal level, although just barely in this case. This should not be especially surprising as our expectation that the likelihood-ratio has a Χ2 distribution under the null is asymptotic with N, meaning that we only expect it to hold true for relatively large sample sizes. Conversely, of course, we should expect the type I error to increase for tests conducted using smaller trees (which it does - though I have not shown that here).
Next, we can examine the power of the method to detect a difference in rate between trees if one is present. Here I will simulate a transition rate q1 for tree 1 fixed at 1.0, and vary the transition rate q2 for the second tree between 1.0 and 4.0 in intervals of 0.2. Since this is a whole bunch of simulations, I'm only going to use the first 50 sets of trees from before:
trees<-trees[1:50]
q1<-1
q2<-seq(1,4,by=0.2)
foo<-function(q2,q1,trees){
Q1<-matrix(c(-q1,q1,q1,-q1),2,2,dimnames=list(letters[1:2],
letters[1:2]))
Q2<-matrix(c(-q2,q2,q2,-q2),2,2,dimnames=list(letters[1:2],
letters[1:2]))
x<-lapply(trees,function(t,Q1,Q2)
list(as.factor(sim.history(t[[1]],Q1,message=F)$states),
as.factor(sim.history(t[[2]],Q2,message=F)$states)),
Q1=Q1,Q2=Q2)
fits<-mapply(ratebytree,trees=trees,x=x,MoreArgs=list(model="ER"),
SIMPLIFY=FALSE)
sapply(fits,function(x) x$P.chisq)
}
P<-sapply(q2,foo,q1=q1,trees=trees)
power<-apply(P,2,function(x) mean(x<=0.05))
plot(power~q2,cex=1.8,bg="grey",pch=21,xlab=expression(q[2]/q[1]))
So, power increases for an increase in the difference in rate between trees 1 & 2 - just as we'd expect. Note that for this smaller set of simulations we see a more dramatically increased type I error rate under the null hypothesis of no difference in rate (q2 / q1 = 1.0).
More later.
No comments:
Post a Comment
Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.