Thursday, August 28, 2014

Faster version of read.newick

I just posted a new version of read.newick that is substantially faster than prior versions. It is still way slower than read.tree in the ape package; however read.newick may be able to read some Newick strings that are considered badly conformed by read.tree.

First, to establish that the existing version of read.newick is quite slow:

library(phytools)
packageVersion("phytools")
## [1] '0.4.31'
## simulate a tree
tree<-pbtree(n=10000)
## write to a text string
text<-write.tree(tree)
## read into memory using read.newick
system.time(t1<-read.newick(text=text))
##    user  system elapsed 
##   15.85    0.00   15.85
all.equal.phylo(tree,t1)
## [1] TRUE
## by comparison, read.tree
t2<-read.tree(text=text)

So, read.newick is pretty damn slow - but it does work.

We can dig a little deeper to try & figure out precisely what causes read.newick to be so slow using Rprof:

Rprof(tmp<-tempfile(),memory.profiling=TRUE)
t1<-read.newick(text=text)
Rprof()
summaryRprof(tmp,memory="both")
## $by.self
##                 self.time self.pct total.time total.pct mem.total
## "match"             10.98    64.74      11.12     65.57     844.0
## "rbind"              2.58    15.21       2.58     15.21     197.2
## "newick"             1.46     8.61      16.96    100.00    1311.5
## "getEdgeLength"      1.04     6.13       1.74     10.26     130.0
## "getLabel"           0.24     1.42       0.56      3.30      16.0
## "paste"              0.24     1.42       0.24      1.42       3.2
## "+"                  0.10     0.59       0.10      0.59       9.4
## "c"                  0.10     0.59       0.10      0.59       5.5
## "=="                 0.06     0.35       0.06      0.35       0.9
## "as.numeric"         0.04     0.24       0.04      0.24       0.4
## "is.na"              0.04     0.24       0.04      0.24       7.9
## "<"                  0.02     0.12       0.02      0.12       0.0
## ">"                  0.02     0.12       0.02      0.12       3.2
## "strsplit"           0.02     0.12       0.02      0.12       0.0
## "vector"             0.02     0.12       0.02      0.12       3.2
## 
## $by.total
##                       total.time total.pct mem.total self.time self.pct
## "newick"                   16.96    100.00    1311.5      1.46     8.61
## "block_exec"               16.96    100.00    1311.5      0.00     0.00
## "call_block"               16.96    100.00    1311.5      0.00     0.00
## "doTryCatch"               16.96    100.00    1311.5      0.00     0.00
## "eval"                     16.96    100.00    1311.5      0.00     0.00
## "evaluate"                 16.96    100.00    1311.5      0.00     0.00
## "evaluate_call"            16.96    100.00    1311.5      0.00     0.00
## "handle"                   16.96    100.00    1311.5      0.00     0.00
## "in_dir"                   16.96    100.00    1311.5      0.00     0.00
## "knit"                     16.96    100.00    1311.5      0.00     0.00
## "knit2html"                16.96    100.00    1311.5      0.00     0.00
## "process_file"             16.96    100.00    1311.5      0.00     0.00
## "process_group"            16.96    100.00    1311.5      0.00     0.00
## "process_group.block"      16.96    100.00    1311.5      0.00     0.00
## "read.newick"              16.96    100.00    1311.5      0.00     0.00
## "try"                      16.96    100.00    1311.5      0.00     0.00
## "tryCatch"                 16.96    100.00    1311.5      0.00     0.00
## "tryCatchList"             16.96    100.00    1311.5      0.00     0.00
## "tryCatchOne"              16.96    100.00    1311.5      0.00     0.00
## "withCallingHandlers"      16.96    100.00    1311.5      0.00     0.00
## "withVisible"              16.96    100.00    1311.5      0.00     0.00
## "match"                    11.12     65.57     844.0     10.98    64.74
## "rbind"                     2.58     15.21     197.2      2.58    15.21
## "getEdgeLength"             1.74     10.26     130.0      1.04     6.13
## "getLabel"                  0.56      3.30      16.0      0.24     1.42
## "paste"                     0.24      1.42       3.2      0.24     1.42
## "+"                         0.10      0.59       9.4      0.10     0.59
## "c"                         0.10      0.59       5.5      0.10     0.59
## "=="                        0.06      0.35       0.9      0.06     0.35
## "as.numeric"                0.04      0.24       0.4      0.04     0.24
## "is.na"                     0.04      0.24       7.9      0.04     0.24
## "<"                         0.02      0.12       0.0      0.02     0.12
## ">"                         0.02      0.12       3.2      0.02     0.12
## "strsplit"                  0.02      0.12       0.0      0.02     0.12
## "vector"                    0.02      0.12       3.2      0.02     0.12
## "unlist"                    0.02      0.12       0.0      0.00     0.00
## 
## $sample.interval
## [1] 0.02
## 
## $sampling.time
## [1] 16.96
unlink(tmp)

This tells us that our main weak points are calls to match and rbind. match is used multiple times throughout read.newick, but what is causing trouble is use of match to traverse the tree towards the root. rbind is being used to build our matrix edge because we do not know how big it ought to be ahead of time.

So, what can we do? Well my solutions (posted here), which seem fairly obvious in retrospect is to: (1) store the indices of each row for each node as the tree is built; and (2) check how big edge needs to be before we start. (I.e., how many edges there will be in our tree.) This, it turns out, is just the sum of the number of "(" (or ")", these are the same) and "," in our original Newick string.

Let's try it:

source("read.newick.R")
Rprof(tmp<-tempfile(),memory.profiling=TRUE)
t3<-read.newick(text=text)
Rprof()
summaryRprof(tmp,memory="both")
## $by.self
##                 self.time self.pct total.time total.pct mem.total
## "newick"             1.58    51.97       3.04    100.00     299.7
## "match"              0.68    22.37       0.68     22.37      49.9
## "paste"              0.30     9.87       0.32     10.53      28.8
## "getEdgeLength"      0.24     7.89       1.12     36.84     100.1
## "is.na"              0.06     1.97       0.06      1.97       8.3
## "getLabel"           0.04     1.32       0.20      6.58       4.2
## "as.numeric"         0.04     1.32       0.04      1.32       4.5
## "!="                 0.02     0.66       0.02      0.66       0.0
## ":"                  0.02     0.66       0.02      0.66       0.0
## "+"                  0.02     0.66       0.02      0.66       0.0
## "list"               0.02     0.66       0.02      0.66       0.0
## "strsplit"           0.02     0.66       0.02      0.66       0.0
## 
## $by.total
##                       total.time total.pct mem.total self.time self.pct
## "newick"                    3.04    100.00     299.7      1.58    51.97
## "block_exec"                3.04    100.00     299.7      0.00     0.00
## "call_block"                3.04    100.00     299.7      0.00     0.00
## "doTryCatch"                3.04    100.00     299.7      0.00     0.00
## "eval"                      3.04    100.00     299.7      0.00     0.00
## "evaluate"                  3.04    100.00     299.7      0.00     0.00
## "evaluate_call"             3.04    100.00     299.7      0.00     0.00
## "handle"                    3.04    100.00     299.7      0.00     0.00
## "in_dir"                    3.04    100.00     299.7      0.00     0.00
## "knit"                      3.04    100.00     299.7      0.00     0.00
## "knit2html"                 3.04    100.00     299.7      0.00     0.00
## "process_file"              3.04    100.00     299.7      0.00     0.00
## "process_group"             3.04    100.00     299.7      0.00     0.00
## "process_group.block"       3.04    100.00     299.7      0.00     0.00
## "read.newick"               3.04    100.00     299.7      0.00     0.00
## "try"                       3.04    100.00     299.7      0.00     0.00
## "tryCatch"                  3.04    100.00     299.7      0.00     0.00
## "tryCatchList"              3.04    100.00     299.7      0.00     0.00
## "tryCatchOne"               3.04    100.00     299.7      0.00     0.00
## "withCallingHandlers"       3.04    100.00     299.7      0.00     0.00
## "withVisible"               3.04    100.00     299.7      0.00     0.00
## "getEdgeLength"             1.12     36.84     100.1      0.24     7.89
## "match"                     0.68     22.37      49.9      0.68    22.37
## "paste"                     0.32     10.53      28.8      0.30     9.87
## "getLabel"                  0.20      6.58       4.2      0.04     1.32
## "is.na"                     0.06      1.97       8.3      0.06     1.97
## "as.numeric"                0.04      1.32       4.5      0.04     1.32
## "!="                        0.02      0.66       0.0      0.02     0.66
## ":"                         0.02      0.66       0.0      0.02     0.66
## "+"                         0.02      0.66       0.0      0.02     0.66
## "list"                      0.02      0.66       0.0      0.02     0.66
## "strsplit"                  0.02      0.66       0.0      0.02     0.66
## "unlist"                    0.02      0.66       0.0      0.00     0.00
## 
## $sample.interval
## [1] 0.02
## 
## $sampling.time
## [1] 3.04
unlink(tmp)
all.equal.phylo(tree,t3)
## [1] TRUE

That's it.

No comments:

Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.