I just posted a new version of read.newick
that is
substantially faster than prior versions. It is still way slower
than read.tree
in the ape package; however
read.newick
may be able to read some Newick strings
that are considered badly conformed by read.tree
.
First, to establish that the existing version of
read.newick
is quite slow:
library(phytools)
packageVersion("phytools")
## [1] '0.4.31'
## simulate a tree
tree<-pbtree(n=10000)
## write to a text string
text<-write.tree(tree)
## read into memory using read.newick
system.time(t1<-read.newick(text=text))
## user system elapsed
## 15.85 0.00 15.85
all.equal.phylo(tree,t1)
## [1] TRUE
## by comparison, read.tree
t2<-read.tree(text=text)
So, read.newick
is pretty damn slow - but it does work.
We can dig a little deeper to try & figure out precisely what
causes read.newick
to be so slow using Rprof
:
Rprof(tmp<-tempfile(),memory.profiling=TRUE)
t1<-read.newick(text=text)
Rprof()
summaryRprof(tmp,memory="both")
## $by.self
## self.time self.pct total.time total.pct mem.total
## "match" 10.98 64.74 11.12 65.57 844.0
## "rbind" 2.58 15.21 2.58 15.21 197.2
## "newick" 1.46 8.61 16.96 100.00 1311.5
## "getEdgeLength" 1.04 6.13 1.74 10.26 130.0
## "getLabel" 0.24 1.42 0.56 3.30 16.0
## "paste" 0.24 1.42 0.24 1.42 3.2
## "+" 0.10 0.59 0.10 0.59 9.4
## "c" 0.10 0.59 0.10 0.59 5.5
## "==" 0.06 0.35 0.06 0.35 0.9
## "as.numeric" 0.04 0.24 0.04 0.24 0.4
## "is.na" 0.04 0.24 0.04 0.24 7.9
## "<" 0.02 0.12 0.02 0.12 0.0
## ">" 0.02 0.12 0.02 0.12 3.2
## "strsplit" 0.02 0.12 0.02 0.12 0.0
## "vector" 0.02 0.12 0.02 0.12 3.2
##
## $by.total
## total.time total.pct mem.total self.time self.pct
## "newick" 16.96 100.00 1311.5 1.46 8.61
## "block_exec" 16.96 100.00 1311.5 0.00 0.00
## "call_block" 16.96 100.00 1311.5 0.00 0.00
## "doTryCatch" 16.96 100.00 1311.5 0.00 0.00
## "eval" 16.96 100.00 1311.5 0.00 0.00
## "evaluate" 16.96 100.00 1311.5 0.00 0.00
## "evaluate_call" 16.96 100.00 1311.5 0.00 0.00
## "handle" 16.96 100.00 1311.5 0.00 0.00
## "in_dir" 16.96 100.00 1311.5 0.00 0.00
## "knit" 16.96 100.00 1311.5 0.00 0.00
## "knit2html" 16.96 100.00 1311.5 0.00 0.00
## "process_file" 16.96 100.00 1311.5 0.00 0.00
## "process_group" 16.96 100.00 1311.5 0.00 0.00
## "process_group.block" 16.96 100.00 1311.5 0.00 0.00
## "read.newick" 16.96 100.00 1311.5 0.00 0.00
## "try" 16.96 100.00 1311.5 0.00 0.00
## "tryCatch" 16.96 100.00 1311.5 0.00 0.00
## "tryCatchList" 16.96 100.00 1311.5 0.00 0.00
## "tryCatchOne" 16.96 100.00 1311.5 0.00 0.00
## "withCallingHandlers" 16.96 100.00 1311.5 0.00 0.00
## "withVisible" 16.96 100.00 1311.5 0.00 0.00
## "match" 11.12 65.57 844.0 10.98 64.74
## "rbind" 2.58 15.21 197.2 2.58 15.21
## "getEdgeLength" 1.74 10.26 130.0 1.04 6.13
## "getLabel" 0.56 3.30 16.0 0.24 1.42
## "paste" 0.24 1.42 3.2 0.24 1.42
## "+" 0.10 0.59 9.4 0.10 0.59
## "c" 0.10 0.59 5.5 0.10 0.59
## "==" 0.06 0.35 0.9 0.06 0.35
## "as.numeric" 0.04 0.24 0.4 0.04 0.24
## "is.na" 0.04 0.24 7.9 0.04 0.24
## "<" 0.02 0.12 0.0 0.02 0.12
## ">" 0.02 0.12 3.2 0.02 0.12
## "strsplit" 0.02 0.12 0.0 0.02 0.12
## "vector" 0.02 0.12 3.2 0.02 0.12
## "unlist" 0.02 0.12 0.0 0.00 0.00
##
## $sample.interval
## [1] 0.02
##
## $sampling.time
## [1] 16.96
unlink(tmp)
This tells us that our main weak points are calls to match
and rbind
. match
is used multiple times
throughout read.newick
, but what is causing trouble is use
of match
to traverse the tree towards the root.
rbind
is being used to build our matrix edge
because we do not know how big it ought to be ahead of time.
So, what can we do? Well my solutions (posted here), which
seem fairly obvious in retrospect is to: (1) store the indices of each row
for each node as the tree is built; and (2) check how big edge
needs to be before we start. (I.e., how many edges there will be in our
tree.) This, it turns out, is just the sum of the number of "("
(or ")"
, these are the same) and ","
in our
original Newick string.
Let's try it:
source("read.newick.R")
Rprof(tmp<-tempfile(),memory.profiling=TRUE)
t3<-read.newick(text=text)
Rprof()
summaryRprof(tmp,memory="both")
## $by.self
## self.time self.pct total.time total.pct mem.total
## "newick" 1.58 51.97 3.04 100.00 299.7
## "match" 0.68 22.37 0.68 22.37 49.9
## "paste" 0.30 9.87 0.32 10.53 28.8
## "getEdgeLength" 0.24 7.89 1.12 36.84 100.1
## "is.na" 0.06 1.97 0.06 1.97 8.3
## "getLabel" 0.04 1.32 0.20 6.58 4.2
## "as.numeric" 0.04 1.32 0.04 1.32 4.5
## "!=" 0.02 0.66 0.02 0.66 0.0
## ":" 0.02 0.66 0.02 0.66 0.0
## "+" 0.02 0.66 0.02 0.66 0.0
## "list" 0.02 0.66 0.02 0.66 0.0
## "strsplit" 0.02 0.66 0.02 0.66 0.0
##
## $by.total
## total.time total.pct mem.total self.time self.pct
## "newick" 3.04 100.00 299.7 1.58 51.97
## "block_exec" 3.04 100.00 299.7 0.00 0.00
## "call_block" 3.04 100.00 299.7 0.00 0.00
## "doTryCatch" 3.04 100.00 299.7 0.00 0.00
## "eval" 3.04 100.00 299.7 0.00 0.00
## "evaluate" 3.04 100.00 299.7 0.00 0.00
## "evaluate_call" 3.04 100.00 299.7 0.00 0.00
## "handle" 3.04 100.00 299.7 0.00 0.00
## "in_dir" 3.04 100.00 299.7 0.00 0.00
## "knit" 3.04 100.00 299.7 0.00 0.00
## "knit2html" 3.04 100.00 299.7 0.00 0.00
## "process_file" 3.04 100.00 299.7 0.00 0.00
## "process_group" 3.04 100.00 299.7 0.00 0.00
## "process_group.block" 3.04 100.00 299.7 0.00 0.00
## "read.newick" 3.04 100.00 299.7 0.00 0.00
## "try" 3.04 100.00 299.7 0.00 0.00
## "tryCatch" 3.04 100.00 299.7 0.00 0.00
## "tryCatchList" 3.04 100.00 299.7 0.00 0.00
## "tryCatchOne" 3.04 100.00 299.7 0.00 0.00
## "withCallingHandlers" 3.04 100.00 299.7 0.00 0.00
## "withVisible" 3.04 100.00 299.7 0.00 0.00
## "getEdgeLength" 1.12 36.84 100.1 0.24 7.89
## "match" 0.68 22.37 49.9 0.68 22.37
## "paste" 0.32 10.53 28.8 0.30 9.87
## "getLabel" 0.20 6.58 4.2 0.04 1.32
## "is.na" 0.06 1.97 8.3 0.06 1.97
## "as.numeric" 0.04 1.32 4.5 0.04 1.32
## "!=" 0.02 0.66 0.0 0.02 0.66
## ":" 0.02 0.66 0.0 0.02 0.66
## "+" 0.02 0.66 0.0 0.02 0.66
## "list" 0.02 0.66 0.0 0.02 0.66
## "strsplit" 0.02 0.66 0.0 0.02 0.66
## "unlist" 0.02 0.66 0.0 0.00 0.00
##
## $sample.interval
## [1] 0.02
##
## $sampling.time
## [1] 3.04
unlink(tmp)
all.equal.phylo(tree,t3)
## [1] TRUE
That's it.
No comments:
Post a Comment
Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.