I just added a minor update to my ltt() function. The only significant difference between this function and plotLtt() in Rabosky's {laser} package, is that my function can also plot trees with non-contemporaneous (i.e., extinct) tips. The code for ltt() can be found on my R-phylogenetics page (direct link here).
Basically, in preparing for a talk that I will be giving tomorrow, I realized that my function relied on a "cladewise" order of the edges of the input "phylo" object. This is the order that is created by read.tree() (as well as my functions read.newick() and read.simmap()). However, some other function in R return trees with edges in "pruningwise" order, which is more efficient for post-order tree traversal and pruning. [For more information about this, type ?reorder.phylo at the prompt.]
Luckily, {ape} has a function that can switch from "pruningwise" to "cladewise" orders, or vice versa. So, the bug with ltt() was an easy fix. I simply added the line:
tree<-reorder.phylo(tree,order="cladewise")
to the code before beginning calculation.
[In addition to this change, I also moved "tol" - the tolerance threshold for rounding errors in node heights - to be an optional input instead of fixed. This probably won't affect most users.]
Wednesday, March 30, 2011
Friday, March 25, 2011
For fun: least squares phylogeny estimation
For my class this semester on "Applied Phylogenetic Methods" (which, based on a chat today with Ron Etter's postdoc Rob Jennings, I've now concluded should more properly be called "Phylogenetics Demystified"), I recently programmed an implementation of phylogeny estimation using the unweighted least squares method (on my R-phylogenetics page, code here [update, v0.2]) - originally developed by Cavalli-Sforza & Edwards (1967), and described in detail by Felsenstein (2004, which is also our textbook for the class). Felsenstein (2004) calls the least squares methods the statistically "best-justified" distance matrix methods (p. 148). To my knowledge there is no direct implementation of least squares in any R package [although it should be possible to call the fastME program from within R and minimum evolution uses the least squares branch lengths].
As the title of this post suggests, this implementation is really just for fun, and should not be considered a serious function for phylogeny inference! This is because it is really not programmed efficiently and is thus very, very slow for more than a modest number of taxa. It also uses only nearest neighbor interchange to search treespace (capitalizing on the nni() function of the very useful {phangorn} package) thus will tend to get stuck on "islands" of suboptimal trees (not connected to better trees by a single NNI) if the starting tree is not very good. Nonetheless, the function does appear to work. If we give a matrix containing the true evolutionary distances between species (i.e., the distances in the true tree), then the program will tend to find the correct topology and branch lengths given a reasonable starting tree.
The basic principle of least squares phylogeny estimation is straightforward. We would like to find a tree and set of branch lengths that minimizes the summed squared deviations between our observed distances (i.e., the set of distances that serve as input for our analysis) and the hypothesized distances (i.e., the patristic distances in our inferred tree).
So, in other words, our model is as follows: D = Xv+e; in which D is a vector of distances between species; X is a design matrix based on a hypothesized topology (to be discussed below); v is a vector of edge lengths; and e, of course, is our residual error.
Given a topology to solve for our estimated edge lengths we can just use our standard least squares estimating equation: v = (X'X)-1X'D. We can then calculate the sum of squares residual error (the quantity we want to minimize) for a given tree as: Q = (D-Xv)'(D-Xv).
So, there are two things we want to find. We want to find v that minimize Q conditioned a tree topology (given by X), which we can easily do using least squares estimation; but we also want to find the tree that minimizes Q given our least squares estimate of v.
We first need to compute X given a tree. I have not described the design matrix X to this point, but X is an N=n(n-1)/2 (for n species) by m=2n-3 (for a fully bifurcating unrooted tree) matrix. Here, you can probably see that N is just the number of distances between species; and m is the number of edges in the tree. We put a "1" in the Xij if the jth edge can be found on the path between the pair of species represented in the ith row of the matrix. To make this a little clearer, consider the unrooted, fully labeled tree at right and the design matrix calculated with my function phyloDesign() below:
> phyloDesign(tree)
6,7 7,1 7,2 6,8 8,3 8,4 6,5
1,2 0 1 1 0 0 0 0
1,3 1 1 0 1 1 0 0
1,4 1 1 0 1 0 1 0
1,5 1 1 0 0 0 0 1
2,3 1 0 1 1 1 0 0
2,4 1 0 1 1 0 1 0
2,5 1 0 1 0 0 0 1
3,4 0 0 0 0 1 1 0
3,5 0 0 0 1 1 0 1
4,5 0 0 0 1 0 1 1
Now, with this design matrix we can compute our least squares branches and Q, and it is then only a matter of finding the tree (and thus the design matrix, X) that minimizes Q. To do this, I have used nni() from the {phangorn} package to design a greedy (but nonetheless remarkably slow) algorithm for finding the least squares tree. nni() returns all the one-step away NNI trees for a given tree. So, I just returned all these, compute X, v, and Q for each tree, pick the lowest one, and repeat until Q no longer improves from NNIs. Obviously, this algorithm will easily get stuck on low-lying "hills" in treespace. I'm not familiar with other more aggressive tree rearrangement functions available in R (for instance, TBR), but perhaps my readers are and if so, they can tell me!
To see how this works, let's take the distance matrix for 8 mammals given by Felsenstein (2004; p. 163), as follows:
> D.mammals
dog bear raccoon weasel seal sea_lion cat monkey
dog 0 32 48 51 50 48 98 148
bear 32 0 26 34 29 33 84 136
raccoon 48 26 0 42 44 44 92 152
weasel 51 34 42 0 44 38 86 142
seal 50 29 44 44 0 24 89 142
sea_lion 48 33 44 38 24 0 90 142
cat 98 84 92 86 89 90 0 148
monkey 148 136 152 142 142 142 148 0
Now, let's load the function from source:
> source("optim.phylo.ls.R")
This will actually load four functions: optim.phylo.ls(), ls.tree(), phyloDesign(), and compute.ancestor.nodes(). It will also load the packages {phangorn} (and, of course, {ape}) on first execution.
> lstree<-optim.phylo.ls(D.mammals,upgma(D.mammals))
[1] "1 set(s) of nearest neighbor interchanges. best Q so far = 143.854166666667"
[1] "2 set(s) of nearest neighbor interchanges. best Q so far = 98.8333333333332"
[1] "3 set(s) of nearest neighbor interchanges. best Q so far = 92.1166666666666"
best Q score of 92.1166666666666 found after 3 nearest neighber interchange(s).
Using the UPGMA tree as our starting tree. Then we can plot our tree and edge lengths:
> plot(midpoint(lstree)) # use midpoint rooting
> edgelabels(round(midpoint(lstree)$edge.length,2), adj=c(0.5,-0.2),frame="none")
which results in a tree that makes a fair bit of sense. Note that the zero length branch in this tree was actually negative in the LS tree, but my program sets negative branch lengths to zero by default (although this can be changed by the user).
As the title of this post suggests, this implementation is really just for fun, and should not be considered a serious function for phylogeny inference! This is because it is really not programmed efficiently and is thus very, very slow for more than a modest number of taxa. It also uses only nearest neighbor interchange to search treespace (capitalizing on the nni() function of the very useful {phangorn} package) thus will tend to get stuck on "islands" of suboptimal trees (not connected to better trees by a single NNI) if the starting tree is not very good. Nonetheless, the function does appear to work. If we give a matrix containing the true evolutionary distances between species (i.e., the distances in the true tree), then the program will tend to find the correct topology and branch lengths given a reasonable starting tree.
The basic principle of least squares phylogeny estimation is straightforward. We would like to find a tree and set of branch lengths that minimizes the summed squared deviations between our observed distances (i.e., the set of distances that serve as input for our analysis) and the hypothesized distances (i.e., the patristic distances in our inferred tree).
So, in other words, our model is as follows: D = Xv+e; in which D is a vector of distances between species; X is a design matrix based on a hypothesized topology (to be discussed below); v is a vector of edge lengths; and e, of course, is our residual error.
Given a topology to solve for our estimated edge lengths we can just use our standard least squares estimating equation: v = (X'X)-1X'D. We can then calculate the sum of squares residual error (the quantity we want to minimize) for a given tree as: Q = (D-Xv)'(D-Xv).
So, there are two things we want to find. We want to find v that minimize Q conditioned a tree topology (given by X), which we can easily do using least squares estimation; but we also want to find the tree that minimizes Q given our least squares estimate of v.
We first need to compute X given a tree. I have not described the design matrix X to this point, but X is an N=n(n-1)/2 (for n species) by m=2n-3 (for a fully bifurcating unrooted tree) matrix. Here, you can probably see that N is just the number of distances between species; and m is the number of edges in the tree. We put a "1" in the Xij if the jth edge can be found on the path between the pair of species represented in the ith row of the matrix. To make this a little clearer, consider the unrooted, fully labeled tree at right and the design matrix calculated with my function phyloDesign() below:
> phyloDesign(tree)
6,7 7,1 7,2 6,8 8,3 8,4 6,5
1,2 0 1 1 0 0 0 0
1,3 1 1 0 1 1 0 0
1,4 1 1 0 1 0 1 0
1,5 1 1 0 0 0 0 1
2,3 1 0 1 1 1 0 0
2,4 1 0 1 1 0 1 0
2,5 1 0 1 0 0 0 1
3,4 0 0 0 0 1 1 0
3,5 0 0 0 1 1 0 1
4,5 0 0 0 1 0 1 1
Now, with this design matrix we can compute our least squares branches and Q, and it is then only a matter of finding the tree (and thus the design matrix, X) that minimizes Q. To do this, I have used nni() from the {phangorn} package to design a greedy (but nonetheless remarkably slow) algorithm for finding the least squares tree. nni() returns all the one-step away NNI trees for a given tree. So, I just returned all these, compute X, v, and Q for each tree, pick the lowest one, and repeat until Q no longer improves from NNIs. Obviously, this algorithm will easily get stuck on low-lying "hills" in treespace. I'm not familiar with other more aggressive tree rearrangement functions available in R (for instance, TBR), but perhaps my readers are and if so, they can tell me!
To see how this works, let's take the distance matrix for 8 mammals given by Felsenstein (2004; p. 163), as follows:
> D.mammals
dog bear raccoon weasel seal sea_lion cat monkey
dog 0 32 48 51 50 48 98 148
bear 32 0 26 34 29 33 84 136
raccoon 48 26 0 42 44 44 92 152
weasel 51 34 42 0 44 38 86 142
seal 50 29 44 44 0 24 89 142
sea_lion 48 33 44 38 24 0 90 142
cat 98 84 92 86 89 90 0 148
monkey 148 136 152 142 142 142 148 0
Now, let's load the function from source:
> source("optim.phylo.ls.R")
This will actually load four functions: optim.phylo.ls(), ls.tree(), phyloDesign(), and compute.ancestor.nodes(). It will also load the packages {phangorn} (and, of course, {ape}) on first execution.
> lstree<-optim.phylo.ls(D.mammals,upgma(D.mammals))
[1] "1 set(s) of nearest neighbor interchanges. best Q so far = 143.854166666667"
[1] "2 set(s) of nearest neighbor interchanges. best Q so far = 98.8333333333332"
[1] "3 set(s) of nearest neighbor interchanges. best Q so far = 92.1166666666666"
best Q score of 92.1166666666666 found after 3 nearest neighber interchange(s).
Using the UPGMA tree as our starting tree. Then we can plot our tree and edge lengths:
> plot(midpoint(lstree)) # use midpoint rooting
> edgelabels(round(midpoint(lstree)$edge.length,2), adj=c(0.5,-0.2),frame="none")
which results in a tree that makes a fair bit of sense. Note that the zero length branch in this tree was actually negative in the LS tree, but my program sets negative branch lengths to zero by default (although this can be changed by the user).
Tuesday, March 22, 2011
Postdoctoral position in my lab
I am hiring a postdoctoral researcher for my lab - to start on or after Aug. 1, 2011. I am particularly interested in applicants with experience studying evolutionary ecology of lizards; and/or computational methods for evolutionary biology (particularly phylogenetic methods). The initial appointment is for just one year, but we would ideally renew this appointment for multiple years (contingent on funding).
If interested, please check out the official HR job listing here. For more information about my lab, see my lab webpage. Finally, please feel free to email me to get more information about the position (or post questions of general interest to the blog).
If interested, please check out the official HR job listing here. For more information about my lab, see my lab webpage. Finally, please feel free to email me to get more information about the position (or post questions of general interest to the blog).
Monday, March 21, 2011
Order of traits in brownie.lite() and evol.vcv()
A couple of users (e.g., here) have identified an annoying property of both brownie.lite() and evol.vcv() [both available on my R-phylogenetics page] - that is, the functions order the estimated evolutionary rates or matrices by the order in which the corresponding mapped states appear in the tree (not, for instance, in numeric or alphabetical order). For instance, consider the following code (which also gives a method to simulate Brownian rate variation on the tree):
> # a Newick style SIMMAP tree
> text<-"((10:{te,0.4},4:{te,0.4}):{aq,1.00:te,0.28},((1:{aq,0.98},(((8:{aq,0.05},9:{aq,0.05}):{aq,0.03},7:{aq,0.08}):{te,0.06:aq,0.25},(5:{te,0.08},6:{te,0.08}):{te,0.3}):{te,0.59}):{te,0.04},(2:{aq,0.52},3:{aq,0.52}):{te,0.25:aq,0.25}):{aq,0.06:te,0.6});"
> tr1<-read.simmap(text=text) # read in tree
> simtree<-list(edge=tr1$edge,Nnode=tr1$Nnode, edge.length=tr1$mapped.edge[,"aq"]+20*tr1$mapped.edge[,"te"], tip.label=tr1$tip.label) # make tree for simulation
> attr(simtree,"class")<-"phylo"
> x<-fastBM(simtree) # simulate using fastBM() or sim.char()
> brownie.lite(tr1,x) # fit model
$sig2.single
[1] 8.57495
$var.single
[1] 14.70595
$logL1
[1] -22.13206
$k1
[1] 2
$sig2.multiple
aq te
0.9701962 22.1362945
$vcv.multiple
aq te
aq 0.6387571 -0.9954994
te -0.9954994 160.1397276
$logL.multiple
[1] -20.14824
$k2
[1] 3
$P.chisq
[1] 0.04638306
$convergence
[1] "Optimization has converged."
Note that is is very easy to get the mapped state at the root as follows:
> names(tr1$maps[[1]])[1]
[1] "aq"
because the first edge in the tree will always arise from the root node. The order of $sig2.multiple matches the order of states on the tree (i.e., in this case starting with "aq").
Now, let's flip the state at the root node by artificially rearranging the order of the mapped edges in our tree (altered portions in bold italic):
> text<-"((10:{te,0.4},4:{te,0.4}):{te,0.28:aq,1.00},((1:{aq,0.98},(((8:{aq,0.05},9:{aq,0.05}):{aq,0.03},7:{aq,0.08}):{te,0.06:aq,0.25},(5:{te,0.08},6:{te,0.08}):{te,0.3}):{te,0.59}):{te,0.04},(2:{aq,0.52},3:{aq,0.52}):{te,0.25:aq,0.25}):{te,0.6:aq,0.06});"
> tr2<-read.simmap(text=text)
> names(tr2$maps[[1]])[1]
[1] "te"
> brownie.lite(tr2,x)
...
$sig2.multiple
te aq
22.1362944 0.9701962
$vcv.multiple
te aq
te 160.1399035 -0.9954995
aq -0.9954995 0.6387571
$logL.multiple
[1] -20.14824
...
We can see that the estimated rates are the same (as we would expect), but their order in $sig2.multiple and $vcv.multiple has flipped.
This property (ordering the rates by their temporal appearance, rather than alphanumerically) was by design, not chance; however I can see how it might be annoying for someone analyzing a dataset containing many SIMMAP trees.
Fortunately, there is an easy fix. Just rearrange the columns of $mapped.edge in the order that you want the mapped states to be ordered. For instance:
> tr2$mapped.edge<-tr2$mapped.edge[,c("aq","te")]
> brownie.lite(tr2,x)
...
$sig2.multiple
aq te
0.9701962 22.1362945
$vcv.multiple
aq te
aq 0.6387571 -0.9954994
te -0.9954994 160.1397276
$logL.multiple
[1] -20.14824
...
and the problem is solved.
> # a Newick style SIMMAP tree
> text<-"((10:{te,0.4},4:{te,0.4}):{aq,1.00:te,0.28},((1:{aq,0.98},(((8:{aq,0.05},9:{aq,0.05}):{aq,0.03},7:{aq,0.08}):{te,0.06:aq,0.25},(5:{te,0.08},6:{te,0.08}):{te,0.3}):{te,0.59}):{te,0.04},(2:{aq,0.52},3:{aq,0.52}):{te,0.25:aq,0.25}):{aq,0.06:te,0.6});"
> tr1<-read.simmap(text=text) # read in tree
> simtree<-list(edge=tr1$edge,Nnode=tr1$Nnode, edge.length=tr1$mapped.edge[,"aq"]+20*tr1$mapped.edge[,"te"], tip.label=tr1$tip.label) # make tree for simulation
> attr(simtree,"class")<-"phylo"
> x<-fastBM(simtree) # simulate using fastBM() or sim.char()
> brownie.lite(tr1,x) # fit model
$sig2.single
[1] 8.57495
$var.single
[1] 14.70595
$logL1
[1] -22.13206
$k1
[1] 2
$sig2.multiple
aq te
0.9701962 22.1362945
$vcv.multiple
aq te
aq 0.6387571 -0.9954994
te -0.9954994 160.1397276
$logL.multiple
[1] -20.14824
$k2
[1] 3
$P.chisq
[1] 0.04638306
$convergence
[1] "Optimization has converged."
Note that is is very easy to get the mapped state at the root as follows:
> names(tr1$maps[[1]])[1]
[1] "aq"
because the first edge in the tree will always arise from the root node. The order of $sig2.multiple matches the order of states on the tree (i.e., in this case starting with "aq").
Now, let's flip the state at the root node by artificially rearranging the order of the mapped edges in our tree (altered portions in bold italic):
> text<-"((10:{te,0.4},4:{te,0.4}):{te,0.28:aq,1.00},((1:{aq,0.98},(((8:{aq,0.05},9:{aq,0.05}):{aq,0.03},7:{aq,0.08}):{te,0.06:aq,0.25},(5:{te,0.08},6:{te,0.08}):{te,0.3}):{te,0.59}):{te,0.04},(2:{aq,0.52},3:{aq,0.52}):{te,0.25:aq,0.25}):{te,0.6:aq,0.06});"
> tr2<-read.simmap(text=text)
> names(tr2$maps[[1]])[1]
[1] "te"
> brownie.lite(tr2,x)
...
$sig2.multiple
te aq
22.1362944 0.9701962
$vcv.multiple
te aq
te 160.1399035 -0.9954995
aq -0.9954995 0.6387571
$logL.multiple
[1] -20.14824
...
We can see that the estimated rates are the same (as we would expect), but their order in $sig2.multiple and $vcv.multiple has flipped.
This property (ordering the rates by their temporal appearance, rather than alphanumerically) was by design, not chance; however I can see how it might be annoying for someone analyzing a dataset containing many SIMMAP trees.
Fortunately, there is an easy fix. Just rearrange the columns of $mapped.edge in the order that you want the mapped states to be ordered. For instance:
> tr2$mapped.edge<-tr2$mapped.edge[,c("aq","te")]
> brownie.lite(tr2,x)
...
$sig2.multiple
aq te
0.9701962 22.1362945
$vcv.multiple
aq te
aq 0.6387571 -0.9954994
te -0.9954994 160.1397276
$logL.multiple
[1] -20.14824
...
and the problem is solved.
Thursday, March 17, 2011
Computing phylogenetic signal
I have posted a new function to compute phylogenetic signal for continuous traits using two methods: the lambda method of Pagel (1999) and Blomberg et al. (2003)'s K-statistic. Rich Glor already created a very helpful wiki page on how to do this in R using several different functions. My function, phylosig() (code here), should make this even easier. With phylosig() you can specify method (presently either "K" or "lambda") and optionally return a P-value for either the randomization test of Blomberg et al. or a likelihood ratio test against the null hypothesis that lambda=0.
For instance, after loading our function from source, to compute K and conduct a hypothesis test of the null hypothesis of no signal we just type:
> phylosig(tree,x,test=TRUE)
$K
[1] 1.053583
$P
[1] 0.001
By default, the function uses method="K" and nsim=1000.
Next, we can also estimate lambda using maximum likelihood:
> phylosig(tree,x,method="lambda",test=TRUE)
$lambda
[1] 0.9999339
$logL
[1] -132.7309
$P
[1] 0
Now, for comparison, let's simulate data with no phylogenetic signal:
> x<-rnorm(100); names(x)<-tree$tip.label
> phylosig(tree,x,test=TRUE)
$K
[1] 0.1020861
$P
[1] 0.426
> phylosig(tree,x,method="lambda",test=TRUE)
$lambda
[1] 6.610696e-05
$logL
[1] -138.0889
$P
[1] 1
Cool. Please check out the function (on my R-phylogenetics page) and let me know how it works.
For instance, after loading our function from source, to compute K and conduct a hypothesis test of the null hypothesis of no signal we just type:
> phylosig(tree,x,test=TRUE)
$K
[1] 1.053583
$P
[1] 0.001
By default, the function uses method="K" and nsim=1000.
Next, we can also estimate lambda using maximum likelihood:
> phylosig(tree,x,method="lambda",test=TRUE)
$lambda
[1] 0.9999339
$logL
[1] -132.7309
$P
[1] 0
Now, for comparison, let's simulate data with no phylogenetic signal:
> x<-rnorm(100); names(x)<-tree$tip.label
> phylosig(tree,x,test=TRUE)
$K
[1] 0.1020861
$P
[1] 0.426
> phylosig(tree,x,method="lambda",test=TRUE)
$lambda
[1] 6.610696e-05
$logL
[1] -138.0889
$P
[1] 1
Cool. Please check out the function (on my R-phylogenetics page) and let me know how it works.
Tuesday, March 15, 2011
Bug fixed in read.simmap()
Dave Collar helpfully identified a bug in v0.2 of read.simmap() that was preventing the function from properly reading in a dataset of multiple trees. [Since this function was a complete rewrite of v0.1, this was not a problem in the earlier edition.] This bug has now been fixed and the code (v0.3) is available on my R-phylogenetics page (direct link to code here).
Matrix representation parsimony (MRP) supertree estimation
Are any of my readers aware of an existing function to do Matrix Representation Parsimony (MRP) supertree estimation (Baum 1992; Ragan 1992; also see Sanderson et al. 1998; Bininda-Emonds et al. 2007) in R? I'm not, but I was thinking it would be a fun - and easy - thing to program. Consequently I have just posted a function for this, called mrp.supertree(), on my R-phylogenetics page (direct link to code here). This function could also be used to get the MRP consensus tree (that is, if all the input phylogenies have the same set of taxa; e.g., see Felsenstein 2004 pp. 527-528).
I programmed this by using two handy functions: the function prop.part() in {ape} (which returns all the non-trivial bipartitions for a given tree) and pratchet() in {phangorn} (Schliep 2010), which estimates the MP tree using the parsimony ratchet method of Nixon (1999).
The way MRP supertree estimation works is by first constructing N (for N trees) ni x mi binary matrices (for ni species and mi non-trivial bipartitions in the ith tree). We score the tree matrices by putting a 1 in the j,kth position if the jth taxon is found in the kth bipartition, for each bipartition, and a 0 otherwise. We then accumulate these matrices into a supermatrix which contains union(speciesi) rows and sum(mi) columns. We put a "?" into the supermatrix if a species is missing from the tree. Finally, our MRP supertree is the MP tree estimated for this binary supermatrix.
To see how well this function does, let's first generate a random reference tree with 50 species:
> true.tree<-rtree(n=50,br=NULL)
Now let's randomly prune 40-60% of the species from this tree ten times:
> trees<-list()
> class(trees)<-"multiPhylo"
> for(i in 1:10)
trees[[i]]<- drop.tip(true.tree,sample(true.tree$tip.label,
size=(20+round(runif(1)*10))))
Now let's load our MRP supertree code from source and run it:
> source("mrp.supertree.R")
> super.tree<-mrp.supertree(trees)
The MRP supertree, optimized via pratchet(),
has a parsimony score of 247 (minimum 247)
Now we can compare the estimated supertree to the true tree using dist.topo() in {ape}:
> dist.topo(true.tree,super.tree)
[1] 4
In this case, it is close. If we have high overlap this will generally be the case. For instance if we only drop 10-20 species from the each tree at random:
> for(i in 1:10)
trees[[i]]<- drop.tip(true.tree,sample(true.tree$tip.label,
size=(10+round(runif(1)*10))))
> super.tree<-mrp.supertree(trees)
The MRP supertree, optimized via pratchet(),
has a parsimony score of 315 (minimum 315)
> dist.topo(true.tree,super.tree)
[1] 0
whereas if we drop 30-40 at random:
> for(i in 1:10)
trees[[i]]<-drop.tip(true.tree,
sample(true.tree$tip.label,
size=(30+round(runif(1)*10))))
> super.tree<-mrp.supertree(trees)
The MRP supertree, optimized via pratchet(),
has a parsimony score of 208 (minimum 141)
> dist.topo(true.tree,super.tree)
[1] 82
A few notes on this implementation. Firstly, I set pratchet() to return all equally parsimonious trees (if several are found, mrp.supertree() will return a "multiPhylo" object) - but this is not the same as a guarantee to find all most parsimonious trees. This could only be guaranteed with an exhaustive search. Secondly, pratchet() does not always find the MP tree - so you might run mrp.supertree() multiple times if the parsimony score is greater than the theoretically minimal score.
Feedback welcome.
I programmed this by using two handy functions: the function prop.part() in {ape} (which returns all the non-trivial bipartitions for a given tree) and pratchet() in {phangorn} (Schliep 2010), which estimates the MP tree using the parsimony ratchet method of Nixon (1999).
The way MRP supertree estimation works is by first constructing N (for N trees) ni x mi binary matrices (for ni species and mi non-trivial bipartitions in the ith tree). We score the tree matrices by putting a 1 in the j,kth position if the jth taxon is found in the kth bipartition, for each bipartition, and a 0 otherwise. We then accumulate these matrices into a supermatrix which contains union(speciesi) rows and sum(mi) columns. We put a "?" into the supermatrix if a species is missing from the tree. Finally, our MRP supertree is the MP tree estimated for this binary supermatrix.
To see how well this function does, let's first generate a random reference tree with 50 species:
> true.tree<-rtree(n=50,br=NULL)
Now let's randomly prune 40-60% of the species from this tree ten times:
> trees<-list()
> class(trees)<-"multiPhylo"
> for(i in 1:10)
trees[[i]]<- drop.tip(true.tree,sample(true.tree$tip.label,
size=(20+round(runif(1)*10))))
Now let's load our MRP supertree code from source and run it:
> source("mrp.supertree.R")
> super.tree<-mrp.supertree(trees)
The MRP supertree, optimized via pratchet(),
has a parsimony score of 247 (minimum 247)
Now we can compare the estimated supertree to the true tree using dist.topo() in {ape}:
> dist.topo(true.tree,super.tree)
[1] 4
In this case, it is close. If we have high overlap this will generally be the case. For instance if we only drop 10-20 species from the each tree at random:
> for(i in 1:10)
trees[[i]]<- drop.tip(true.tree,sample(true.tree$tip.label,
size=(10+round(runif(1)*10))))
> super.tree<-mrp.supertree(trees)
The MRP supertree, optimized via pratchet(),
has a parsimony score of 315 (minimum 315)
> dist.topo(true.tree,super.tree)
[1] 0
whereas if we drop 30-40 at random:
> for(i in 1:10)
trees[[i]]<-drop.tip(true.tree,
sample(true.tree$tip.label,
size=(30+round(runif(1)*10))))
> super.tree<-mrp.supertree(trees)
The MRP supertree, optimized via pratchet(),
has a parsimony score of 208 (minimum 141)
> dist.topo(true.tree,super.tree)
[1] 82
A few notes on this implementation. Firstly, I set pratchet() to return all equally parsimonious trees (if several are found, mrp.supertree() will return a "multiPhylo" object) - but this is not the same as a guarantee to find all most parsimonious trees. This could only be guaranteed with an exhaustive search. Secondly, pratchet() does not always find the MP tree - so you might run mrp.supertree() multiple times if the parsimony score is greater than the theoretically minimal score.
Feedback welcome.
Monday, March 14, 2011
New version of fastBM() for multiple simulations
In a previous post I described a fast Brownian simulation function, fastBM(), and I showed that it was indeed quite fast. I have since added some additional options to this function - for instance, simulating with a trend or simulating within hard boundaries.
One teeny-tiny problem though. Although this function is much faster than sim.char() in {geiger} for a single simulation, e.g.:
> tree<-rtree(1000)
> system.time(x<-sim.char(phy=tree,as.matrix(1)))
user system elapsed
80.87 0.00 80.99
compared to:
> system.time(x<-fastBM(tree))
user system elapsed
0.18 0.00 0.19
[using v0.2 of fastBM() on an Intel Core i5 @ 3.20GHz], it can actually be much slower for multiple simulations. For instance, sim.char() takes basically the same time to run 100 simulations as it does to run 1:
> system.time(X<-sim.char(phy=tree,as.matrix(1),nsim=100))
user system elapsed
80.76 0.02 80.87
whereas the only way to run multiple simulations on the same tree using fastBM() has been to loop the function many times, e.g.
> X<-matrix(NA,length(tree$tip.label),100,
dimnames=list(tree$tip.label))
> system.time(X<-apply(X,2,function(x) x<-fastBM(tree)))
user system elapsed
19.74 0.00 19.74
Although fastBM() still does better in this example, it has gone from being 400 times faster - to a mere 4 times faster! That is because when we just loop calls to fastBM(), computation time increases in direct proportion to the number of simulations (in this case ~0.2s x 100 = 20s). By contrast, sim.char() only has one very hard calculation to do - the eigen-decomposition of the matrix given by vcv(tree). Once this calculation is complete, the function needs very little additional time to run more simulations.
It is not difficult to see where this is going. For any 400 simulations sim.char() and fastBM() take the same amount of time, and for any more than that sim.char() becomes faster.
Today, I decided to add the capacity to run multiple simulations - without looping over the tree many times - into fastBM(). The new version (tagged v0.4) is available on my R-phylogenetics page (direct link to code here). This was really pretty easy to do. The only tricky parts involved evaluating boundary violations across all the layers of the simulated data array, and then reflecting if node values were outside the bound. To do this, I first created an internal function reflect():
reflect<-function(yy,bounds){
while(yy < bounds[1]||yy > bounds[2]){
if(yy < bounds[1]) yy<-2*bounds[1]-yy
if(yy > bounds[2]) yy<-2*bounds[2]-yy
}
return(yy)
}
and then I called this function using apply(), i.e.:
y[i,2,]<-apply(as.matrix(y[i,2,]),1,function(yy)
reflect(yy,bounds))
I also created a logical variable, no.bounds, to avoid calling the internal function reflect() if the bounds are trivial (in other words, for bounds=c(-Inf,Inf), the default). I.e.:
if(!no.bounds) y[i,2,]<-apply(as.matrix(y[i,2,]),1,function(yy)
reflect(yy,bounds))
With these problems solved, we can now evaluate the performance of our new version of fastBM():
> system.time(X<-fastBM(tree,nsim=100))
user system elapsed
0.28 0.00 0.28
The instinct to limit our calls to reflect() was indeed a good idea, because simulating with bounds is quite a bit slower than this, e.g.:
> system.time(X<-fastBM(tree,nsim=100,bounds=c(-2,4)))
user system elapsed
2.91 0.00 2.92
Neat.
Since I wrote this code this afternoon, please let me know if you are able to get it to run (either post a comment - or email me directly).
One teeny-tiny problem though. Although this function is much faster than sim.char() in {geiger} for a single simulation, e.g.:
> tree<-rtree(1000)
> system.time(x<-sim.char(phy=tree,as.matrix(1)))
user system elapsed
80.87 0.00 80.99
compared to:
> system.time(x<-fastBM(tree))
user system elapsed
0.18 0.00 0.19
[using v0.2 of fastBM() on an Intel Core i5 @ 3.20GHz], it can actually be much slower for multiple simulations. For instance, sim.char() takes basically the same time to run 100 simulations as it does to run 1:
> system.time(X<-sim.char(phy=tree,as.matrix(1),nsim=100))
user system elapsed
80.76 0.02 80.87
whereas the only way to run multiple simulations on the same tree using fastBM() has been to loop the function many times, e.g.
> X<-matrix(NA,length(tree$tip.label),100,
dimnames=list(tree$tip.label))
> system.time(X<-apply(X,2,function(x) x<-fastBM(tree)))
user system elapsed
19.74 0.00 19.74
Although fastBM() still does better in this example, it has gone from being 400 times faster - to a mere 4 times faster! That is because when we just loop calls to fastBM(), computation time increases in direct proportion to the number of simulations (in this case ~0.2s x 100 = 20s). By contrast, sim.char() only has one very hard calculation to do - the eigen-decomposition of the matrix given by vcv(tree). Once this calculation is complete, the function needs very little additional time to run more simulations.
It is not difficult to see where this is going. For any 400 simulations sim.char() and fastBM() take the same amount of time, and for any more than that sim.char() becomes faster.
Today, I decided to add the capacity to run multiple simulations - without looping over the tree many times - into fastBM(). The new version (tagged v0.4) is available on my R-phylogenetics page (direct link to code here). This was really pretty easy to do. The only tricky parts involved evaluating boundary violations across all the layers of the simulated data array, and then reflecting if node values were outside the bound. To do this, I first created an internal function reflect():
reflect<-function(yy,bounds){
while(yy < bounds[1]||yy > bounds[2]){
if(yy < bounds[1]) yy<-2*bounds[1]-yy
if(yy > bounds[2]) yy<-2*bounds[2]-yy
}
return(yy)
}
and then I called this function using apply(), i.e.:
y[i,2,]<-apply(as.matrix(y[i,2,]),1,function(yy)
reflect(yy,bounds))
I also created a logical variable, no.bounds, to avoid calling the internal function reflect() if the bounds are trivial (in other words, for bounds=c(-Inf,Inf), the default). I.e.:
if(!no.bounds) y[i,2,]<-apply(as.matrix(y[i,2,]),1,function(yy)
reflect(yy,bounds))
With these problems solved, we can now evaluate the performance of our new version of fastBM():
> system.time(X<-fastBM(tree,nsim=100))
user system elapsed
0.28 0.00 0.28
The instinct to limit our calls to reflect() was indeed a good idea, because simulating with bounds is quite a bit slower than this, e.g.:
> system.time(X<-fastBM(tree,nsim=100,bounds=c(-2,4)))
user system elapsed
2.91 0.00 2.92
Neat.
Since I wrote this code this afternoon, please let me know if you are able to get it to run (either post a comment - or email me directly).
Saturday, March 12, 2011
Walter Fitch (1929-2011)
I'm sad to report on this blog the news that Walter Fitch, pioneer in the field of molecular phylogenetics, long time professor and former chair of biology at UC Irvine, and member of the National Academy of Sciences, passed away yesterday morning. I learned of his passing today via the Evoldir email listserve. The original email message sent from the current chair of biology at UC Irvine can be read here. To learn more about the contributions of Fitch to molecular phylogenetics refer to one of many references online (e.g., 1, 2) or, especially, read Joe Felsenstein's book "Inferring Phylogenies." Among Walter Fitch's many lasting contributions to phylogenetic systematics is a fast algorithm for computing the parsimony score of a site pattern on a tree (Fitch 1971) - now known as the "Fitch parsimony algorithm."
Tuesday, March 8, 2011
Significant updates: phylomorphospace()
I just made a few significant updates to my function to create the phylomorphospace() plots of Sidlauskas (2008). This new version can be downloaded from my R phylogenetics page (direct link to code here).
First, I removed the requirement that the user provide a matrix of the estimated states at internal nodes of the tree. Now, if this is left out the function will simply use the {ape} function ace() to estimate these values using maximum likelihood.
Second, I liked the appearance of Luke's ecomorph plot a lot, so I decided to make a slightly modified version of this design the default for the function. This means that internal nodes are small filled circles, and terminal nodes are larger filled circles (although I did not use the same size disparity as in Luke's plot - see below). In addition, plotting tip labels is now optional and can be turned off by setting label=FALSE.
Finally, third, I added some basic error checking to make the function less buggy.
Feedback welcome!
[Click on image for a higher quality version.]
First, I removed the requirement that the user provide a matrix of the estimated states at internal nodes of the tree. Now, if this is left out the function will simply use the {ape} function ace() to estimate these values using maximum likelihood.
Second, I liked the appearance of Luke's ecomorph plot a lot, so I decided to make a slightly modified version of this design the default for the function. This means that internal nodes are small filled circles, and terminal nodes are larger filled circles (although I did not use the same size disparity as in Luke's plot - see below). In addition, plotting tip labels is now optional and can be turned off by setting label=FALSE.
Finally, third, I added some basic error checking to make the function less buggy.
Feedback welcome!
[Click on image for a higher quality version.]
Monday, March 7, 2011
brownie.lite() tutorial
Sam Price has created a wiki tutorial for my brownie.lite() function that conducts the rate variation analysis described in O'Meara et al. (2006) and implemented in Brian's C++ program Brownie; and for my function to read stochastic character map formatted trees, read.simmap(). The tutorial is far more detailed than anything that I could have prepared and comes with an empirical example to test. The link to the tutorial is here. The R functions brownie.lite() and read.simmap() are available from my R phylogenetics page (direct links to code are here and here, respectively).
Creating a star phylogeny
I just posted up a function to create a star phylogeny (that is, a completely unresolved multifurcating tree) on my R phylogenetics page (code here). Very simple, and it seems likely that this can already be done with an existing R function that I am not aware of.
Basically, recalling back to the structure of a "phylo" object in memory, we just need to create an edge matrix containing n+1 (for n species) in the left column and 1:n in the right column:
edge<-matrix(NA,n,2)
edge[,1]<-n+1; edge[,2]<-1:n
Then we can optionally assign branch lengths, if they have been provided by the user:
if(!is.null(branch.lengths)) edge.length=branch.lengths
Finally we combine these elements into a list and assign the class "phylo":
phy<-list(edge=edge,edge.length=edge.length,Nnode=1,
tip.label=as.vector(species))
class(phy)<-"phylo"
and we're done.
Basically, recalling back to the structure of a "phylo" object in memory, we just need to create an edge matrix containing n+1 (for n species) in the left column and 1:n in the right column:
edge<-matrix(NA,n,2)
edge[,1]<-n+1; edge[,2]<-1:n
Then we can optionally assign branch lengths, if they have been provided by the user:
if(!is.null(branch.lengths)) edge.length=branch.lengths
Finally we combine these elements into a list and assign the class "phylo":
phy<-list(edge=edge,edge.length=edge.length,Nnode=1,
tip.label=as.vector(species))
class(phy)<-"phylo"
and we're done.
Sunday, March 6, 2011
Counting descendant leaves
David Bapst was today interested in computing the number of tips (leaves) descended from each internal node in the tree. These can be obtained (for each node above the root) for the {ape} "phylo" object phy using the following command:
> require(geiger) # load the geiger package
> desc<-sapply(tree$edge[,2],function(x) node.leaves(tree,x))
> Ndesc<-sapply(desc,function(x) length(x))
I thought it might be faster to count the number of descendants as we read our tree into memory. This is possible because by the time our Newick tree parser reaches a ")" in the string, all of the descendant edges and tips already exist in memory. If we are diligent about assigning counts to all internal nodes, to get the count for the current node all we have to do is add the counts for its two (or more) daughters.
To allow David to see if this is indeed faster, I have added this to my existing (and otherwise not particularly useful) function read.newick(). Now this function creates a "phylo" object with the additional vector $Ndesc containing the number of tips descended from each node above the root. The order of $Ndesc is the row order of $edge[,2]. I have posted this to my R phylogenetics page (direct link to code here).
> require(geiger) # load the geiger package
> desc<-sapply(tree$edge[,2],function(x) node.leaves(tree,x))
> Ndesc<-sapply(desc,function(x) length(x))
I thought it might be faster to count the number of descendants as we read our tree into memory. This is possible because by the time our Newick tree parser reaches a ")" in the string, all of the descendant edges and tips already exist in memory. If we are diligent about assigning counts to all internal nodes, to get the count for the current node all we have to do is add the counts for its two (or more) daughters.
To allow David to see if this is indeed faster, I have added this to my existing (and otherwise not particularly useful) function read.newick(). Now this function creates a "phylo" object with the additional vector $Ndesc containing the number of tips descended from each node above the root. The order of $Ndesc is the row order of $edge[,2]. I have posted this to my R phylogenetics page (direct link to code here).
Saturday, March 5, 2011
BM simulation with bounds
Inspired by another R-sig-phylo listserve discussion, I just added the capacity to conduct bounded Brownian simulation to my fastBM() function. This can be found on my R-phylogenetics page (direct link to code here).
Doing this was easy. I just added a simple while() loop which runs after a descendant node phenotype is calculated. The while() condition evaluates whether the phenotype is on the interval given by bounds, and if not, then it returns the phenotype to within the bounds by bouncing it back by the amount that it would have otherwise exceeded the boundary.
while(y[i,2] < bounds[1] || y[i,2] > bounds[2]){
if(y[i,2]< bounds[1]) y[i,2] <- 2*bounds[1]-y[i,2]
if(y[i,2] > bounds[2]) y[i,2] <- 2*bounds[2]-y[i,2]
}
Seems to work . . . . Feedback welcome!
Doing this was easy. I just added a simple while() loop which runs after a descendant node phenotype is calculated. The while() condition evaluates whether the phenotype is on the interval given by bounds, and if not, then it returns the phenotype to within the bounds by bouncing it back by the amount that it would have otherwise exceeded the boundary.
while(y[i,2] < bounds[1] || y[i,2] > bounds[2]){
if(y[i,2]< bounds[1]) y[i,2] <- 2*bounds[1]-y[i,2]
if(y[i,2] > bounds[2]) y[i,2] <- 2*bounds[2]-y[i,2]
}
Seems to work . . . . Feedback welcome!
Friday, March 4, 2011
Prune tree to a list of taxa
Using the {ape} function drop.tip() we can easily excise a single taxon or a list of taxa from our "phylo" tree object in R. However, it is not immediately obvious how to prune the tree to include, rather than exclude, a specific list of tips. Trina Roberts (now at NESCent) shared a trick to do this with me some time ago, and I thought I'd pass it along to the readers of this blog.
First, let's start with a tree of 10 species:
> tree<-rtree(10)
> write.tree(tree)
[1] "(t8:0.22,((((t3:0.9,(t7:0.48,t2:0.5):0.12):0.47,t6:0.55):0.08,(t5:0.49,(t9:0.71,t10:0.13):0.15):0.7):0.87,(t1:0.72,t4:0.62):0.55):0.47);"
Now, say we want to keep the species t2, t4, t6, t8, and t10 in our pruned tree, we just put these tip names into a vector:
> species<-c("t2","t4","t6","t8","t10")
[More commonly, this vector will probably come from the row names in our data matrix, or we might read it from a text file.]
We create the pruned tree with one command:
> pruned.tree<-drop.tip(tree,tree$tip.label[-match(species, tree$tip.label)])
Now we have our pruned tree, as desired:
> write.tree(pruned.tree)
[1] "(t8:0.22,(((t2:1.09,t6:0.55):0.08,t10:0.98):0.87,t4:1.17):0.47);"
First, let's start with a tree of 10 species:
> tree<-rtree(10)
> write.tree(tree)
[1] "(t8:0.22,((((t3:0.9,(t7:0.48,t2:0.5):0.12):0.47,t6:0.55):0.08,(t5:0.49,(t9:0.71,t10:0.13):0.15):0.7):0.87,(t1:0.72,t4:0.62):0.55):0.47);"
Now, say we want to keep the species t2, t4, t6, t8, and t10 in our pruned tree, we just put these tip names into a vector:
> species<-c("t2","t4","t6","t8","t10")
[More commonly, this vector will probably come from the row names in our data matrix, or we might read it from a text file.]
We create the pruned tree with one command:
> pruned.tree<-drop.tip(tree,tree$tip.label[-match(species, tree$tip.label)])
Now we have our pruned tree, as desired:
> write.tree(pruned.tree)
[1] "(t8:0.22,(((t2:1.09,t6:0.55):0.08,t10:0.98):0.87,t4:1.17):0.47);"
Subscribe to:
Posts (Atom)