Saturday, February 7, 2015

Reproducible analysis code for our forthcoming Anolis roosevelti (aka. locate.yeti) Evolution paper

The following is the reproducible analysis code used in our 'in press' Evolution article in which we present a method for placing recently extinct taxa on a phylogeny using continuous character data and apply it to the case of Anolis roosevelti, a giant anole from the Puerto Rican bank Virgin Islands that is most likely extinct. The paper should be out in the first half of the year, but this will be posted to Dryad soon with our data.

------------------------------

Analyis & code for “Placing cryptic, recently extinct, or hypothesized taxa into an ultrametric phylogeny using continuous character data: A case study with the lizard Anolis roosevelti

First load packages. Note that use of Rphylip also requires that the PHYLIP package be installed. This is used to compute the 'branch-score' distances of Kuhner & Felsenstein (1994).

set.seed(1)
library(phytools)
library(Rphylip)
library(phangorn)
library(clusterGeneration)

Next, here is our wrapper function for simulation & analysis:

foo<-function(N,m,nsim){
    cat("simulation conditions:\n")
    cat(paste("\tN =",N,"\n\tm =",m,"\n\tnsim=",nsim,"\n"))
    ## vectors for results
    rf<-bs<-vector()
    for(i in 1:nsim){
        ## simulate tree
        tt<-tree<-pbtree(n=N+1,tip.label=sample(c(paste("t",1:N,sep=""),"Yeti")),scale=1)
        if(m>1){
            ## generate a covariance matrix for simulation
            V<-genPositiveDefMat(m,covMethod="unifcorrmat")$Sigma
            X<-sim.corrs(tree,vcv=V)
        } else X<-as.matrix(fastBM(tree))
        ## prune unknown taxon "Yeti" from tree
        tree<-drop.tip(tree,"Yeti")
        ## run locate.yeti
        mltree<-locate.yeti(tree,X,plot=FALSE,search="exhaustive",quiet=TRUE)
        rf[i]<-RF.dist(tt,mltree)
        bs[i]<-Rtreedist(tt,trees2=mltree,quiet=TRUE)
        cat(".")
    }
    cat("\nDone.\n")
    return(list(rf=rf,bs=bs))
}

Now let's run our simulation showing the result for various numbers of taxa given a fixed number of traits:

m<-10 ## number of traits (for fixed number of traits)
nsim<-100 ## number of simulation
N<-seq(20,100,by=10)
objN<-lapply(N,foo,m=m,nsim=nsim)
## simulation conditions:
##  N = 20 
##  m = 10 
##  nsim= 100 
## ..............................................................
## Warning: running command 'rm outfile' had status 1
## ......................................
## Done.
## simulation conditions:
##  N = 30 
##  m = 10 
##  nsim= 100 
## ....................................................................................................
## Done.
## simulation conditions:
##  N = 40 
##  m = 10 
##  nsim= 100 
## .................................................................................
## Warning: running command 'rm outfile' had status 1
## ...................
## Done.
## simulation conditions:
##  N = 50 
##  m = 10 
##  nsim= 100 
## .............................................
## Warning: running command 'rm outfile' had status 1
## .......................................................
## Done.
## simulation conditions:
##  N = 60 
##  m = 10 
##  nsim= 100 
## ..........................................................................................
## Warning: running command 'rm intree' had status 1
## ..........
## Done.
## simulation conditions:
##  N = 70 
##  m = 10 
##  nsim= 100 
## .....................
## Warning: running command 'rm intree' had status 1
## ...............................................................................
## Done.
## simulation conditions:
##  N = 80 
##  m = 10 
##  nsim= 100 
## ....................................................................................................
## Done.
## simulation conditions:
##  N = 90 
##  m = 10 
##  nsim= 100 
## ............................
## Warning: running command 'rm intree2' had status 1
## ........................................................
## Warning: running command 'rm outfile' had status 1
## ................
## Done.
## simulation conditions:
##  N = 100 
##  m = 10 
##  nsim= 100 
## ..................................................................
## Warning: running command 'rm outfile' had status 1
## ..................................
## Done.

Next, we have a function to generate a null distribution of tree distances as if the missing taxon were attached randomly in our tree. Here is what that code looks like:

null.treedist<-function(N,nsim){
    tt<-pbtree(n=(N+1),nsim=nsim,tip.label=c(paste("t",1:N,sep=""),"Yeti"),scale=1)
    trees<-lapply(tt,drop.tip,tip="Yeti")
    trees<-lapply(trees,add.random,tips="Yeti")
    class(trees)<-"multiPhylo"
    rf<-mapply(RF.dist,tt,trees)
    bs<-Rtreedist(tt,trees2=trees,distances="corresponding",quiet=TRUE)
    list(rf=rf,bs=bs)
}

We can use it to recreate panels A & B of Figure 3, showing a comparison between locate.yeti and random placement of the missing tip:

## generate null trees
nullN<-lapply(N,null.treedist,nsim=nsim)
layout(matrix(c(1,2),1,2,byrow=TRUE))
par(mar=c(4.1,4.1,3.1,1.1))
## plot RF distances
nullRF<-sapply(nullN,function(x) x$rf)
colnames(nullRF)<-N
par(fg=grey(0.2,0.5))
boxplot(nullRF,col=grey(0.5,0.2),ylab="R-F distance",xlab="number of taxa (N-1)",ylim=c(0,40))
par(fg="black")
RF<-sapply(objN,function(x) x$rf)
colnames(RF)<-N
boxplot(RF,add=TRUE)
title("A                                                             ")
## plot branch-score distances
nullBS<-sapply(nullN,function(x) x$bs)
colnames(nullBS)<-N
par(fg=grey(0.2,0.5))
boxplot(nullBS,col=grey(0.5,0.2),ylab="B-S distance",xlab="number of taxa (N-1)",ylim=c(0,2.5))
par(fg="black")
BS<-sapply(objN,function(x) x$bs)
colnames(BS)<-N
boxplot(BS,add=TRUE)
title("B                                                             ")

plot of chunk unnamed-chunk-5

Now, let's do the same analysis, but this time vary the number of traits:

set.seed(1)
N<-50 ## number of taxa (for fixed number of taxa)
m<-c(1,2,5,10,20) ## number of traits
objM<-lapply(m,foo,N=N,nsim=nsim)
## simulation conditions:
##  N = 50 
##  m = 1 
##  nsim= 100 
## ....................................................................................................
## Done.
## simulation conditions:
##  N = 50 
##  m = 2 
##  nsim= 100 
## ........................................................................
## Warning: running command 'rm intree2' had status 1
## ............................
## Done.
## simulation conditions:
##  N = 50 
##  m = 5 
##  nsim= 100 
## ....................................................................
## Warning: running command 'rm outfile' had status 1
## ................................
## Done.
## simulation conditions:
##  N = 50 
##  m = 10 
##  nsim= 100 
## ......................................................
## Warning: running command 'rm outfile' had status 1
## ..............................................
## Done.
## simulation conditions:
##  N = 50 
##  m = 20 
##  nsim= 100 
## ..............................................
## Warning: running command 'rm outfile' had status 1
## .................................
## Warning: running command 'rm outfile' had status 1
## .....................
## Done.

Now we can recreate panels C & D of Figure 3, showing a comparison between locate.yeti and random placement of the missing tip, for various numbers of traits:

layout(matrix(c(1,2),1,2,byrow=TRUE))
par(mar=c(4.1,4.1,3.1,1.1))
nullM<-nullN[[which(N==50)]]
nullRF<-matrix(rep(nullM$rf,length(m)),100,length(m),byrow=FALSE)
colnames(nullRF)<-m
par(fg=grey(0.2,0.5))
boxplot(nullRF,col=grey(0.5,0.2),ylab="R-F distance",xlab="number of characters (m)",ylim=c(0,40))
par(fg="black")
RF<-sapply(objM,function(x) x$rf)
colnames(RF)<-m
boxplot(RF,add=TRUE)
title("C                                                             ")
nullBS<-matrix(rep(nullM$bs,length(m)),100,length(m),byrow=FALSE)
colnames(nullBS)<-m
par(fg=grey(0.2,0.5))
boxplot(nullBS,col=grey(0.5,0.2),ylab="B-S distance",xlab="number of characters (m)",ylim=c(0,2.5))
par(fg="black")
BS<-sapply(objM,function(x) x$bs)
colnames(BS)<-m
boxplot(BS,add=TRUE)
title("D                                                             ")

plot of chunk unnamed-chunk-7

tree<-read.tree("Revell-etal.tree.tre")
X<-read.csv("Revell-etal.data.csv",header=T,row.names=1)
X<-as.matrix(X)
X<-X[,1:20]
tip<-setdiff(rownames(X),tree$tip.label)
tip
## [1] "roosevelti"
mltree<-locate.yeti(tree,X,search="exhaustive",plot=TRUE)
## Optimizing the phylogenetic position of roosevelti using ML. Please wait....

plot of chunk unnamed-chunk-8

## Done.
mltree$logL
## [1] 3392

Now we can create the contMap style plot of Figure 4.

mltree<-reorder(reorder(mltree,"pruningwise")) ## reorder hack
pca<-phyl.pca(mltree,X)
## flip PC 1 so that it loads positively (not negatively) on size
obj<-contMap(mltree,-pca$S[,1],plot=FALSE)
class(obj)<-"densityMap"
## fix tip label
obj$tree$tip.label[which(obj$tree$tip.label=="pumilis")]<-"pumilus"
## plot
plot(obj,type="fan",lwd=4,legend=0.7,
    leg.txt=c(round(obj$lims[1],3),"PC 1",round(obj$lims[2],3)),
    outline=TRUE)
## add arrows
add.arrow(tree=obj$tree,tip="roosevelti",col="red",lwd=3,hedl=0.06,angle=50)
add.arrow(tree=obj$tree,tip="cuvieri",col="blue",lwd=3,hedl=0.06,angle=50)

plot of chunk unnamed-chunk-9

Finally, we can do the analyses to test the (null) hypotheses that A. roosevelti is siste to A. cuvieri, or sister-to or nested-within the majority of the other Puerto Rican anoles:

## find ML cuvieri-constraint tree
cuvieri<-which(tree$tip.label=="cuvieri")
cuvieri.tree<-locate.yeti(tree,X,search="exhaustive",constraint=cuvieri)
## Optimizing the phylogenetic position of roosevelti using ML. Please wait....
## Done.
## compute a LR to test the alternative hypothesis that the A. roosevelti
## is *not* sister
LR.cuvieri<-2*(mltree$logL-cuvieri.tree$logL)
## perform simulation for the null distribution of the LR
set.seed(1)
LR.null<-vector()
nsim<-100
obj<-phyl.vcv(X[cuvieri.tree$tip.label,],vcv(cuvieri.tree),lambda=1)
for(i in 1:nsim){
    Xsim<-sim.corrs(tree=cuvieri.tree,vcv=obj$R,anc=obj$alpha[,1])
    mltree.null<-locate.yeti(tree,Xsim,search="exhaustive",plot=FALSE,
        quiet=TRUE)
    cuvieri.tree.null<-locate.yeti(tree,Xsim,search="exhaustive",
        constraint=which(tree$tip.label=="cuvieri"),plot=FALSE,
        quiet=TRUE)
    LR.null[i]<-2*(mltree.null$logL-cuvieri.tree.null$logL)
}
P.cuvieri<-1-mean(LR.cuvieri>=LR.null)
## P-value of the LR test
P.cuvieri
## [1] 0.26

This suggests that we cannot reject the hull hypothesis that A. roosevelti is sister to A. cuvieri.

Now, let's do the same thing, but set our constraint tree to be one in which Anolis roosevelti must be found nested within or sister to the clade with most of Puerto Rican's remaining anole species:

sp<-c("cristatellus","cooki","poncensis","gundlachi","pulchellus","krugi",
    "stratulus","evermanni")
pr<-getDescendants(tree,findMRCA(tree,sp))
pr.tree<-locate.yeti(tree,X,search="exhaustive",constraint=pr)
## Optimizing the phylogenetic position of roosevelti using ML. Please wait....
## Done.
LR.pr<-2*(mltree$logL-pr.tree$logL)
LR.null<-vector()
nsim<-100
obj<-phyl.vcv(X[pr.tree$tip.label,],vcv(pr.tree),lambda=1)
for(i in 1:nsim){
    Xsim<-sim.corrs(tree=pr.tree,vcv=obj$R,anc=obj$alpha[,1])
    mltree.null<-locate.yeti(tree,Xsim,search="exhaustive",plot=FALSE,
        quiet=TRUE)
    pr.tree.null<-locate.yeti(tree,Xsim,search="exhaustive",
        constraint=getDescendants(tree,findMRCA(tree,sp)),
        plot=FALSE,quiet=TRUE)
    LR.null[i]<-2*(mltree.null$logL-pr.tree.null$logL)
}
P.pr<-1-mean(LR.pr>=LR.null)
P.pr
## [1] 0.02

Here, by contrast, we can confidently reject the hypothesis that A. roosevelti is part of the main clade of Puerto Rican anoles.

That's pretty much it.

Last updated Jan. 13, 2015.

3 comments:

  1. Nice work! From simulated data and from what RG Reynolds commented, I see that locate.yeti doesn't work with less than 30 tips. Is it an optimization issue or the lack of phylogenetic structure?

    ReplyDelete
    Replies
    1. I wouldn't say that it "doesn't work". In fact, if you look at the panel figures A & B you can see that for 20 taxa the performance is still vastly better than adding the tip at random! I would just say that power declines with fewer taxa in the tree....

      Delete
  2. Thanks Liam. "Doesn't work" was poor wording on my part, sorry. I finally got around to trying out the function with simulated data and whenever I use a tree with ten tips or less I get warnings:

    "In sqrt(V[i, i] * result$Eval[j, j]) : NaNs produced"

    and the missing tip always gets placed as an outgroup. So far I haven't figured out what is causing those warnings.

    Since your talk in Spain I measured almost 30 cranial characters for a genus of rodents with 8 species, including specimens thought to be a cryptic species, and I'm getting the same behaviour. I just wasn't sure if it was power declining with tree size or something else that has to do with the warnings produced with small trees.
    cheers
    Ld

    ReplyDelete