Saturday, January 19, 2019

MCCR test now in phytools

Yesterday I posted about implementing Pybus & Harvey's (2000) MCCR test. This used to be in Dan Rabosky's laser, which is no longer available (except in archived versions) on CRAN. The method is quite simple as it merely uses Monte Carlo simulation to generate a null distribution of the γ-statistic for incompletely sampled phylogenies. I have now added this function (and a couple of associated methods) to phytools. This update can be obtained by installing the latest version of phytools from GitHub using devtools.

Here's a quick demo of how the function works:

library(phytools)
packageVersion("phytools")
## [1] '0.6.66'
tree<-pbtree(n=600,scale=100)
ltt(tree,show.tree=TRUE,lwd=3)

plot of chunk unnamed-chunk-1

## Object of class "ltt" containing:
## 
## (1) A phylogenetic tree with 600 tips and 599 internal nodes.
## 
## (2) Vectors containing the number of lineages (ltt) and branching times (times) on the tree.
## 
## (3) A value for Pybus & Harvey's "gamma" statistic of -0.1158, p-value = 0.9078.
## subsample the tree to introduce a downward bias in gamma:
incomplete<-drop.tip(tree,sample(tree$tip.label,400))
ltt(incomplete,show.tree=TRUE,lwd=3)

plot of chunk unnamed-chunk-1

## Object of class "ltt" containing:
## 
## (1) A phylogenetic tree with 200 tips and 199 internal nodes.
## 
## (2) Vectors containing the number of lineages (ltt) and branching times (times) on the tree.
## 
## (3) A value for Pybus & Harvey's "gamma" statistic of -4.4687, p-value = 0.
## MCCR test to take into account incomplete sampling:
object<-ltt(incomplete,plot=FALSE)
mccr.result<-mccr(object,rho=200/600,nsim=500)
mccr.result
## Object of class "mccr" consisting of:
## 
## (1) A value for Pybus & Harvey's "gamma" statistic of -4.4687.
## 
## (2) A two-tailed p-value from the MCCR test of 0.92.
## 
## (3) A simulated null-distribution of gamma from 500 simulations.
plot(mccr.result)

plot of chunk unnamed-chunk-1

We can also do a small experiment to show that the MCCR test has the correct type I error rate. Here I'm going to simulate five hundred 150 taxon trees, each of which I proceed to randomly subsample to include only 100 taxa. I'll then compute γ & a P-value for the null hypothesis test of γ equal to zero on the original tree; the same things on the subsampled tree; and finally a P-value based on the MCCR method:

foo<-function(){
    full<-pbtree(n=150)
    full.ltt<-ltt(full,plot=FALSE)
    incomplete<-drop.tip(full,sample(full$tip.label,50))
    incomplete.ltt<-ltt(incomplete,plot=FALSE)
    setNames(c(full.ltt$gamma,full.ltt$p,
        incomplete.ltt$gamma,incomplete.ltt$p,
        mccr(incomplete.ltt,rho=100/150,nsim=200)$'P(two-tailed)'),
        c("full-gamma","full-P(gamma)","inc-gamma",
        "inc-P(gamma)","P(mccr)"))
}
sim.Result<-t(replicate(500,foo()))
bb<-seq(0,1,by=0.05)
par(mfrow=c(3,1))
obj<-hist(sim.Result[,2],bb,plot=FALSE)
obj$counts<-obj$counts/nrow(sim.Result)
plot(obj,ylab="relative frequency",col="grey",ylim=c(0,0.4),xlab="P-value",main="")
lines(c(0,1),rep(0.05,2),col="blue",lty="dotted")
mtext(text="a) P-value distribution on complete trees",adj=0,line=1,
    cex=1)
obj<-hist(sim.Result[,4],bb,plot=FALSE)
obj$counts<-obj$counts/nrow(sim.Result)
plot(obj,ylab="relative frequency",col="grey",ylim=c(0,0.4),xlab="P-value",main="")
lines(c(0,1),rep(0.05,2),col="blue",lty="dotted")
mtext(text="b) P-value distribution on incompletely sampled trees",adj=0,line=1,
    cex=1)
obj<-hist(sim.Result[,5],bb,plot=FALSE)
obj$counts<-obj$counts/nrow(sim.Result)
plot(obj,ylab="relative frequency",col="grey",ylim=c(0,0.4),xlab="P-value",main="")
lines(c(0,1),rep(0.05,2),col="blue",lty="dotted")
mtext(text="c) P-value distribution on incompletely sampled trees using MCCR test",
    adj=0,line=1,cex=1)

plot of chunk unnamed-chunk-2

We can see that the parametric test has good type I error on completely sampled tree, but elevated error on incompletely sampled trees. The MCCR test recovers this good type I error of the parametric test when our taxon sampling is incomplete. We can also compute the type I errors directly as follows:

typeI<-setNames(c(mean(sim.Result[,2]<=0.05),
    mean(sim.Result[,4]<=0.05),mean(sim.Result[,5]<=0.05)),
    c("complete","incomplete","MCCR"))
typeI
##   complete incomplete       MCCR 
##      0.044      0.190      0.052

Neat.

Friday, January 18, 2019

MCCR test for Pybus & Harvey's γ on incompletely sampled trees using phytools

The MCCR test for Pybus & Harvey's (2000) γ statistic from incompletely sampled phylogenies used to be implemented in the no-longer-available-on-CRAN laser package of Dan Rabosky. The method is pretty simple, though, & since we use it in teaching I figured it would be easy enough to add to phytools.

Here is a function to do the test, as well as print & plot methods for the resultant object:

mccr<-function(obj,rho=1,nsim=100,plot=TRUE){
    N<-round(Ntip(obj$tree)/rho)
    tt<-pbtree(n=N,nsim=nsim)
    foo<-function(x,m) drop.tip(x,sample(x$tip.label,m))
    tt<-lapply(tt,foo,m=N-Ntip(obj$tree))
    g<-sapply(tt,function(x) ltt(x,plot=FALSE)$gamma)
    P<-if(obj$gamma>median(g)) 2*mean(g>=obj$gamma) else 2*mean(g<=obj$gamma)
    result<-list(gamma=obj$gamma,"P(two-tailed)"=P,null.gamma=g)
    class(result)<-"mccr"
    result
}

print.mccr<-function(x,digits=4,...){
    cat("Object of class \"mccr\" consisting of:\n\n")
    cat(paste("(1) A value for Pybus & Harvey's \"gamma\"",
        " statistic of ",round(x$gamma,digits),".\n\n",sep=""))
    cat(paste("(2) A two-tailed p-value from the MCCR test of ",
        round(x$'P(two-tailed)',digits),".\n\n", sep = ""))
    cat(paste("(3) A simulated null-distribution of gamma from ",
        length(x$null.gamma)," simulations.\n\n",sep=""))
}

plot.mccr<-function(x,...){
    hist(x$null.gamma,breaks=min(c(max(12,round(length(x$null.gamma)/10)),20)),
        xlim=range(c(x$gamma,x$null.gamma)),
        main=expression(paste("null distribution of ",
        gamma)),xlab=expression(gamma),col="lightgrey")
    arrows(x0=x$gamma,y0=par()$usr[4],y1=0,length=0.12,
        col=make.transparent("blue",0.5),lwd=2)
    text(x$gamma,0.98*par()$usr[4],
        expression(paste("observed value of ",gamma)),
        pos=if(x$gamma>mean(x$null.gamma)) 2 else 4)
}

Now let's see how to use it. First, I'll run it on the completely sampled tree:

## pure birth tree with complete sampling:
pb<-pbtree(n=100)
lineages<-ltt(pb)

plot of chunk unnamed-chunk-2

lineages
## Object of class "ltt" containing:
## 
## (1) A phylogenetic tree with 100 tips and 99 internal nodes.
## 
## (2) Vectors containing the number of lineages (ltt) and branching times (times) on the tree.
## 
## (3) A value for Pybus & Harvey's "gamma" statistic of -0.2688, p-value = 0.7881.
## MCCR test assuming complete sampling:
object<-mccr(lineages,rho=1,nsim=200)
object
## Object of class "mccr" consisting of:
## 
## (1) A value for Pybus & Harvey's "gamma" statistic of -0.2688.
## 
## (2) A two-tailed p-value from the MCCR test of 0.67.
## 
## (3) A simulated null-distribution of gamma from 200 simulations.
plot(object)

plot of chunk unnamed-chunk-2

A neat thing to show is that the MCCR test (based on simulation) will give a P-value that converges on the P-value from the parametric test if you run enough simulations. Let's see that:

object<-mccr(lineages,rho=1,nsim=10000)
lineages
## Object of class "ltt" containing:
## 
## (1) A phylogenetic tree with 100 tips and 99 internal nodes.
## 
## (2) Vectors containing the number of lineages (ltt) and branching times (times) on the tree.
## 
## (3) A value for Pybus & Harvey's "gamma" statistic of -0.2688, p-value = 0.7881.
object
## Object of class "mccr" consisting of:
## 
## (1) A value for Pybus & Harvey's "gamma" statistic of -0.2688.
## 
## (2) A two-tailed p-value from the MCCR test of 0.7942.
## 
## (3) A simulated null-distribution of gamma from 10000 simulations.

Next, we can simulate incomplete sampling, see the effect on the P-value from ltt, and then compare that to a P-value obtained from the MCCR test:

incomplete.pb<-drop.tip(pb,sample(pb$tip.label,round(0.5*Ntip(pb))))
lineages<-ltt(incomplete.pb,plot=FALSE)
lineages
## Object of class "ltt" containing:
## 
## (1) A phylogenetic tree with 50 tips and 49 internal nodes.
## 
## (2) Vectors containing the number of lineages (ltt) and branching times (times) on the tree.
## 
## (3) A value for Pybus & Harvey's "gamma" statistic of -0.9876, p-value = 0.3234.
object<-mccr(lineages,rho=0.5,nsim=200)
object
## Object of class "mccr" consisting of:
## 
## (1) A value for Pybus & Harvey's "gamma" statistic of -0.9876.
## 
## (2) A two-tailed p-value from the MCCR test of 0.76.
## 
## (3) A simulated null-distribution of gamma from 200 simulations.

Finally, for fun we can simulate a tree with a shape that should correspond to a positive value of γ. Here, I'll do it using the phytools internal function ebTree. We can then subsample the tree to simulate incomplete sampling.

eb<-phytools:::ebTree(pbtree(n=200,scale=1),-1.2)
ltt(eb)

plot of chunk unnamed-chunk-5

## Object of class "ltt" containing:
## 
## (1) A phylogenetic tree with 200 tips and 199 internal nodes.
## 
## (2) Vectors containing the number of lineages (ltt) and branching times (times) on the tree.
## 
## (3) A value for Pybus & Harvey's "gamma" statistic of 2.0067, p-value = 0.0448.
incomplete.eb<-drop.tip(eb,sample(eb$tip.label,round(0.5*Ntip(eb))))
ltt(incomplete.eb)

plot of chunk unnamed-chunk-5

## Object of class "ltt" containing:
## 
## (1) A phylogenetic tree with 100 tips and 99 internal nodes.
## 
## (2) Vectors containing the number of lineages (ltt) and branching times (times) on the tree.
## 
## (3) A value for Pybus & Harvey's "gamma" statistic of -0.2614, p-value = 0.7938.
object<-mccr(ltt(incomplete.eb,plot=FALSE),rho=0.5,nsim=200)
object
## Object of class "mccr" consisting of:
## 
## (1) A value for Pybus & Harvey's "gamma" statistic of -0.2614.
## 
## (2) A two-tailed p-value from the MCCR test of 0.06.
## 
## (3) A simulated null-distribution of gamma from 200 simulations.
plot(object)

plot of chunk unnamed-chunk-5

Neat.

This is not yet in phytools, but I will add documentation & put it in a future release.