Friday, August 29, 2014

Counting edges from a Newick string

I just posted a new phytools source build (phytools 0.4-32 featuring the newly updated read.newick function. read.newick is a simple tree-reading function - basically redundant with read.tree in the ape package, but slightly more tolerant of "badly conformed" Newick strings. Unfortunately, it is quite slow - although some improvements to the code have dramatically decreased run-time for the function.

At some point I will have to add these improvements to phytools read.simmap as it uses the same basic structure.

install.packages("phytools_0.4-32.tar.gz",type="source",repos=NULL)
## Installing package into 'C:/Users/liam.revell/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
library(phytools)
## Loading required package: ape
## Loading required package: maps

One of the tricks I employed to get read.newick to run faster was to figure out how many edges the tree should have before beginning to parse the Newick string. This turns out to involve simply counting the number of left parentheses (")"; or right parentheses as these will be the same) and the number of commas (",") and summing them. I'm a little embarassed to admit that the way I figured this out was by manually counting the number of commas and parentheses in several trees, setting up a system of linear equations, and solving for the coefficients; however in retrospect it is obvious: all edges in a tree written in Newick format (except for the root edge, which we actually don't care about anyway as it is a separate item in our "phylo" object anyway) have to be followed by either a "," or a ")".

So, for instance:

trees<-allFurcTrees(n=5)
trees
## 26 phylogenetic trees
nedges<-sapply(trees,function(x) nrow(x$edge))
nedges
##  [1] 7 7 7 7 7 6 6 7 7 7 7 7 6 6 7 7 7 7 7 6 6 6 6 6 6 5
## now let's create the Newick strings
obj<-write.tree(trees)
obj
##  [1] "(((3,4),2),1,5);" "(((3,2),4),1,5);" "((3,(4,2)),1,5);"
##  [4] "((3,4),(1,2),5);" "((3,4),1,(5,2));" "((3,4),1,5,2);"  
##  [7] "((2,3,4),1,5);"   "((3,2),(1,4),5);" "(3,((1,4),2),5);"
## [10] "(3,((1,2),4),5);" "(3,(1,(4,2)),5);" "(3,(1,4),(5,2));"
## [13] "(3,(1,4),5,2);"   "(3,(2,1,4),5);"   "((3,2),1,(5,4));"
## [16] "(3,(1,2),(5,4));" "(3,1,((5,4),2));" "(3,1,((5,2),4));"
## [19] "(3,1,(5,(4,2)));" "(3,1,(5,4),2);"   "(3,1,(2,5,4));"  
## [22] "((3,2),1,5,4);"   "(3,(1,2),5,4);"   "(3,1,(5,2),4);"  
## [25] "(3,1,5,(4,2));"   "(3,1,5,4,2);"
## count commas
ncommas<-sapply(obj,function(x) sum(strsplit(x,"")[[1]]==","))
## count right brackets
nbrackets<-sapply(obj,function(x) sum(strsplit(x,"")[[1]]==")"))
ncommas+nbrackets
## (((3,4),2),1,5); (((3,2),4),1,5); ((3,(4,2)),1,5); ((3,4),(1,2),5); 
##                7                7                7                7 
## ((3,4),1,(5,2));   ((3,4),1,5,2);   ((2,3,4),1,5); ((3,2),(1,4),5); 
##                7                6                6                7 
## (3,((1,4),2),5); (3,((1,2),4),5); (3,(1,(4,2)),5); (3,(1,4),(5,2)); 
##                7                7                7                7 
##   (3,(1,4),5,2);   (3,(2,1,4),5); ((3,2),1,(5,4)); (3,(1,2),(5,4)); 
##                6                6                7                7 
## (3,1,((5,4),2)); (3,1,((5,2),4)); (3,1,(5,(4,2)));   (3,1,(5,4),2); 
##                7                7                7                6 
##   (3,1,(2,5,4));   ((3,2),1,5,4);   (3,(1,2),5,4);   (3,1,(5,2),4); 
##                6                6                6                6 
##   (3,1,5,(4,2));     (3,1,5,4,2); 
##                6                5
plot(nedges[order(nedges)],(ncommas+nbrackets)[order(nedges)],type="b")

plot of chunk unnamed-chunk-2

This holds regardless of the tree size:

## n=6
trees<-allFurcTrees(n=6)
trees
## 236 phylogenetic trees
nedges<-sapply(trees,function(x) nrow(x$edge))
obj<-write.tree(trees)
ncommas<-sapply(obj,function(x) sum(strsplit(x,"")[[1]]==","))
nbrackets<-sapply(obj,function(x) sum(strsplit(x,"")[[1]]==")"))
plot(nedges[order(nedges)],(ncommas+nbrackets)[order(nedges)],type="b")

plot of chunk unnamed-chunk-3

## n=7
trees<-allFurcTrees(n=7)
trees
## 2752 phylogenetic trees
nedges<-sapply(trees,function(x) nrow(x$edge))
obj<-write.tree(trees)
ncommas<-sapply(obj,function(x) sum(strsplit(x,"")[[1]]==","))
nbrackets<-sapply(obj,function(x) sum(strsplit(x,"")[[1]]==")"))
plot(nedges[order(nedges)],(ncommas+nbrackets)[order(nedges)],type="b")

plot of chunk unnamed-chunk-3

## n=8
trees<-allFurcTrees(n=8)
trees
## 39208 phylogenetic trees
nedges<-sapply(trees,function(x) nrow(x$edge))
obj<-write.tree(trees)
ncommas<-sapply(obj,function(x) sum(strsplit(x,"")[[1]]==","))
nbrackets<-sapply(obj,function(x) sum(strsplit(x,"")[[1]]==")"))
plot(nedges[order(nedges)],(ncommas+nbrackets)[order(nedges)],type="b")

plot of chunk unnamed-chunk-3

Well, that's all I have to say about this.

Thursday, August 28, 2014

Faster version of read.newick

I just posted a new version of read.newick that is substantially faster than prior versions. It is still way slower than read.tree in the ape package; however read.newick may be able to read some Newick strings that are considered badly conformed by read.tree.

First, to establish that the existing version of read.newick is quite slow:

library(phytools)
packageVersion("phytools")
## [1] '0.4.31'
## simulate a tree
tree<-pbtree(n=10000)
## write to a text string
text<-write.tree(tree)
## read into memory using read.newick
system.time(t1<-read.newick(text=text))
##    user  system elapsed 
##   15.85    0.00   15.85
all.equal.phylo(tree,t1)
## [1] TRUE
## by comparison, read.tree
t2<-read.tree(text=text)

So, read.newick is pretty damn slow - but it does work.

We can dig a little deeper to try & figure out precisely what causes read.newick to be so slow using Rprof:

Rprof(tmp<-tempfile(),memory.profiling=TRUE)
t1<-read.newick(text=text)
Rprof()
summaryRprof(tmp,memory="both")
## $by.self
##                 self.time self.pct total.time total.pct mem.total
## "match"             10.98    64.74      11.12     65.57     844.0
## "rbind"              2.58    15.21       2.58     15.21     197.2
## "newick"             1.46     8.61      16.96    100.00    1311.5
## "getEdgeLength"      1.04     6.13       1.74     10.26     130.0
## "getLabel"           0.24     1.42       0.56      3.30      16.0
## "paste"              0.24     1.42       0.24      1.42       3.2
## "+"                  0.10     0.59       0.10      0.59       9.4
## "c"                  0.10     0.59       0.10      0.59       5.5
## "=="                 0.06     0.35       0.06      0.35       0.9
## "as.numeric"         0.04     0.24       0.04      0.24       0.4
## "is.na"              0.04     0.24       0.04      0.24       7.9
## "<"                  0.02     0.12       0.02      0.12       0.0
## ">"                  0.02     0.12       0.02      0.12       3.2
## "strsplit"           0.02     0.12       0.02      0.12       0.0
## "vector"             0.02     0.12       0.02      0.12       3.2
## 
## $by.total
##                       total.time total.pct mem.total self.time self.pct
## "newick"                   16.96    100.00    1311.5      1.46     8.61
## "block_exec"               16.96    100.00    1311.5      0.00     0.00
## "call_block"               16.96    100.00    1311.5      0.00     0.00
## "doTryCatch"               16.96    100.00    1311.5      0.00     0.00
## "eval"                     16.96    100.00    1311.5      0.00     0.00
## "evaluate"                 16.96    100.00    1311.5      0.00     0.00
## "evaluate_call"            16.96    100.00    1311.5      0.00     0.00
## "handle"                   16.96    100.00    1311.5      0.00     0.00
## "in_dir"                   16.96    100.00    1311.5      0.00     0.00
## "knit"                     16.96    100.00    1311.5      0.00     0.00
## "knit2html"                16.96    100.00    1311.5      0.00     0.00
## "process_file"             16.96    100.00    1311.5      0.00     0.00
## "process_group"            16.96    100.00    1311.5      0.00     0.00
## "process_group.block"      16.96    100.00    1311.5      0.00     0.00
## "read.newick"              16.96    100.00    1311.5      0.00     0.00
## "try"                      16.96    100.00    1311.5      0.00     0.00
## "tryCatch"                 16.96    100.00    1311.5      0.00     0.00
## "tryCatchList"             16.96    100.00    1311.5      0.00     0.00
## "tryCatchOne"              16.96    100.00    1311.5      0.00     0.00
## "withCallingHandlers"      16.96    100.00    1311.5      0.00     0.00
## "withVisible"              16.96    100.00    1311.5      0.00     0.00
## "match"                    11.12     65.57     844.0     10.98    64.74
## "rbind"                     2.58     15.21     197.2      2.58    15.21
## "getEdgeLength"             1.74     10.26     130.0      1.04     6.13
## "getLabel"                  0.56      3.30      16.0      0.24     1.42
## "paste"                     0.24      1.42       3.2      0.24     1.42
## "+"                         0.10      0.59       9.4      0.10     0.59
## "c"                         0.10      0.59       5.5      0.10     0.59
## "=="                        0.06      0.35       0.9      0.06     0.35
## "as.numeric"                0.04      0.24       0.4      0.04     0.24
## "is.na"                     0.04      0.24       7.9      0.04     0.24
## "<"                         0.02      0.12       0.0      0.02     0.12
## ">"                         0.02      0.12       3.2      0.02     0.12
## "strsplit"                  0.02      0.12       0.0      0.02     0.12
## "vector"                    0.02      0.12       3.2      0.02     0.12
## "unlist"                    0.02      0.12       0.0      0.00     0.00
## 
## $sample.interval
## [1] 0.02
## 
## $sampling.time
## [1] 16.96
unlink(tmp)

This tells us that our main weak points are calls to match and rbind. match is used multiple times throughout read.newick, but what is causing trouble is use of match to traverse the tree towards the root. rbind is being used to build our matrix edge because we do not know how big it ought to be ahead of time.

So, what can we do? Well my solutions (posted here), which seem fairly obvious in retrospect is to: (1) store the indices of each row for each node as the tree is built; and (2) check how big edge needs to be before we start. (I.e., how many edges there will be in our tree.) This, it turns out, is just the sum of the number of "(" (or ")", these are the same) and "," in our original Newick string.

Let's try it:

source("read.newick.R")
Rprof(tmp<-tempfile(),memory.profiling=TRUE)
t3<-read.newick(text=text)
Rprof()
summaryRprof(tmp,memory="both")
## $by.self
##                 self.time self.pct total.time total.pct mem.total
## "newick"             1.58    51.97       3.04    100.00     299.7
## "match"              0.68    22.37       0.68     22.37      49.9
## "paste"              0.30     9.87       0.32     10.53      28.8
## "getEdgeLength"      0.24     7.89       1.12     36.84     100.1
## "is.na"              0.06     1.97       0.06      1.97       8.3
## "getLabel"           0.04     1.32       0.20      6.58       4.2
## "as.numeric"         0.04     1.32       0.04      1.32       4.5
## "!="                 0.02     0.66       0.02      0.66       0.0
## ":"                  0.02     0.66       0.02      0.66       0.0
## "+"                  0.02     0.66       0.02      0.66       0.0
## "list"               0.02     0.66       0.02      0.66       0.0
## "strsplit"           0.02     0.66       0.02      0.66       0.0
## 
## $by.total
##                       total.time total.pct mem.total self.time self.pct
## "newick"                    3.04    100.00     299.7      1.58    51.97
## "block_exec"                3.04    100.00     299.7      0.00     0.00
## "call_block"                3.04    100.00     299.7      0.00     0.00
## "doTryCatch"                3.04    100.00     299.7      0.00     0.00
## "eval"                      3.04    100.00     299.7      0.00     0.00
## "evaluate"                  3.04    100.00     299.7      0.00     0.00
## "evaluate_call"             3.04    100.00     299.7      0.00     0.00
## "handle"                    3.04    100.00     299.7      0.00     0.00
## "in_dir"                    3.04    100.00     299.7      0.00     0.00
## "knit"                      3.04    100.00     299.7      0.00     0.00
## "knit2html"                 3.04    100.00     299.7      0.00     0.00
## "process_file"              3.04    100.00     299.7      0.00     0.00
## "process_group"             3.04    100.00     299.7      0.00     0.00
## "process_group.block"       3.04    100.00     299.7      0.00     0.00
## "read.newick"               3.04    100.00     299.7      0.00     0.00
## "try"                       3.04    100.00     299.7      0.00     0.00
## "tryCatch"                  3.04    100.00     299.7      0.00     0.00
## "tryCatchList"              3.04    100.00     299.7      0.00     0.00
## "tryCatchOne"               3.04    100.00     299.7      0.00     0.00
## "withCallingHandlers"       3.04    100.00     299.7      0.00     0.00
## "withVisible"               3.04    100.00     299.7      0.00     0.00
## "getEdgeLength"             1.12     36.84     100.1      0.24     7.89
## "match"                     0.68     22.37      49.9      0.68    22.37
## "paste"                     0.32     10.53      28.8      0.30     9.87
## "getLabel"                  0.20      6.58       4.2      0.04     1.32
## "is.na"                     0.06      1.97       8.3      0.06     1.97
## "as.numeric"                0.04      1.32       4.5      0.04     1.32
## "!="                        0.02      0.66       0.0      0.02     0.66
## ":"                         0.02      0.66       0.0      0.02     0.66
## "+"                         0.02      0.66       0.0      0.02     0.66
## "list"                      0.02      0.66       0.0      0.02     0.66
## "strsplit"                  0.02      0.66       0.0      0.02     0.66
## "unlist"                    0.02      0.66       0.0      0.00     0.00
## 
## $sample.interval
## [1] 0.02
## 
## $sampling.time
## [1] 3.04
unlink(tmp)
all.equal.phylo(tree,t3)
## [1] TRUE

That's it.

Tuesday, August 26, 2014

New version of phytools submitted to CRAN

I just submitted a new version of phytools (phytools 0.4-31) to CRAN. Hopefully it will be accepted and post within 24 hrs.

Some updates in this version include the following:

1. An important bug fix in the phytools function make.simmap for multi-state data.

2. A new version of the phytools 'robust' tree reader, read.newick, that permits a root edge length.

3. New user controls for the function phylo.to.map for projecting a phylogeny onto a geographic map (described here).

4. An update to the phytools function locate.yeti permitting the unknown leaf to attach below the root node of the base tree.

5. A bug fix for the S3 plotting method for objects of class "describe.simmap" (described here).

6. A bug fix for the phytools wrapper function plotBranchbyTrait that corrects the orientation of labels on fan trees, among other things.

7. A new version of threshBayes with an option to suppress printing the MCMC status to screen (which can be annoying for very long chains).

8. A bug fix in read.simmap to address a problem in which trees could not be read from a text string (only from file).

Finally, 9. drop.tip 'methods' for objects of class "contMap" and "densityMap". This also permits the user to supply internal node values to contMap.

I have posted a source build to the phytools page. Hopefully it is also on CRAN soon.

That's it for now.

Wednesday, August 20, 2014

Comparing models fit using GLS with different (nested) parameterizations of the error structure

Today on R-sig-phylo a reader asked:

I have a question conserning the pgls regression in package caper. The function allows to estimate or fix three branch length transformations. I wanna figure out which transformation gives me the best model fit by comparing for example a model with lambda estimated (lambda=“ML”) to a model where I fix lambda at 0.5 (lambda=0.5). However, if I use anova(model1, model2) to compare the two models I get the error message that the two models are run with different branch length transformations, which makes sense. But is there any other possbility to compare models with different branch length transformations? How do I know which model is better?

This is what I responded:

We can compare two fitted models, one in which lambda is estimated and another in which lambda is fixed (at 0, 1, or some other arbitrary value) using a likelihood ratio test. This is what that would look like:

[First, let's simulate some data:

library(phytools)
tree<-pbtree(n=50)
x<-fastBM(tree)
y<-0.75*x+fastBM(phytools:::lambdaTree(tree,0.5),sig2=0.2)
DATA<-data.frame(x=x,y=y)

Done.]

## fit models
## DATA is data frame containing x & y
## tree is object of class "phylo"
library(nlme)
fitBM<-gls(y~x,data=DATA,correlation=corBrownian(1,tree),
        method="ML")
fitBM
## Generalized least squares fit by maximum likelihood
##   Model: y ~ x 
##   Data: DATA 
##   Log-likelihood: -76.21
## 
## Coefficients:
## (Intercept)           x 
##      0.1209      0.7842 
## 
## Correlation Structure: corBrownian
##  Formula: ~1 
##  Parameter estimate(s):
## numeric(0)
## Degrees of freedom: 50 total; 48 residual
## Residual standard error: 2.159
fitLambda<-gls(y~x,data=DATA,correlation=corPagel(1,tree,fixed=FALSE),
    method="ML")
fitLambda
## Generalized least squares fit by maximum likelihood
##   Model: y ~ x 
##   Data: DATA 
##   Log-likelihood: -52.62
## 
## Coefficients:
## (Intercept)           x 
##     0.06996     0.80184 
## 
## Correlation Structure: corPagel
##  Formula: ~1 
##  Parameter estimate(s):
## lambda 
## 0.3873 
## Degrees of freedom: 50 total; 48 residual
## Residual standard error: 0.7515
## function for likelihood ratio test
lrtest<-function(model1,model2){
        lik1<-logLik(model1)
        lik2<-logLik(model2)
        LR<--2*(lik1-lik2)
        degf<-attr(lik2,"df")-attr(lik1,"df")
        P<-pchisq(LR,df=degf,lower.tail=FALSE)
        cat(paste("Likelihood ratio = ",
                signif(LR,5),"(df=",degf,") P = ",
                signif(P,4),"\n",sep=""))
        invisible(list(likelihood.ratio=LR,p=P))
}

## run likelihood-ratio test
lrtest(fitBM,fitLambda)
## Likelihood ratio = 47.176(df=1) P = 6.489e-12

I also noted that for small trees we may want to substitute a distribution obtained via simulation for the parametric χ2 distribtution; however I neglected to comment that only nested models can be compared in this way.

That's it.

Monday, August 18, 2014

drop.tip methods for contMap & densityMap now in phytools

The functions drop.tip.contMap and drop.tip.densityMap (described here for drop.tip.contMap) are now in phytools 0.4-30 which can be downloaded and installed from source.

Here's how they work:

## install & load phytools
install.packages("phytools_0.4-30.tar.gz",type="source")
## Installing package into 'C:/Users/Liam/Documents/R/win-library/3.1'
## (as 'lib' is unspecified)
## inferring 'repos = NULL' from the file name
library(phytools)
packageVersion("phytools")
## [1] '0.4.30'
## simulate tree & data
tree<-pbtree(n=26,tip.label=LETTERS[26:1],scale=1)
x<-fastBM(tree)
## create "contMap" obj
obj<-contMap(tree,x,plot=FALSE)
## drop tips
tips<-sample(LETTERS,10)
tips
##  [1] "L" "K" "N" "E" "G" "P" "J" "V" "M" "S"
obj<-drop.tip.contMap(obj,tips)
## plot
plot(obj)

plot of chunk unnamed-chunk-1

That's it.