Comments on Phylogenetic Tools for Comparative Biology: For fun: least squares phylogeny estimation

If I understand correctly, for a given tree topolo...

2011-04-02T02:25:22.119-04:00

If I understand correctly, for a given tree topology the branch lengths are estimated by least squares for both LS and ME methods, but then the choice among topologies is made by the LS criterion in the first case, and by minimizing the sum of the branch lengths in the second. That might choose the same topology but is not guaranteed to do so. If it does come to the same topology as LS of course the branch lengths are the same. Note that in the ME case you do want to constrain branch lengths not to be negative.

Hi Liam, I believe at least in this case they ar...

2011-03-30T12:10:23.739-04:00

Hi Liam,

I believe at least in this case they are the same. I tried it a few times and when I finally managed to order the tree, the design and distance matrix it seemed to be the case. I found there is always a bit confusion around minimum evolution: minimum evolution as an optimality criterion and as an algorithm (heuristic) like fastME.
Also fastME mostly comes as the weighted version - also very easy to implement - the weights are 2^(-rowSums(X)), I believe.
Anyway, with the code above, it should be now easy to verify numerically, whether edge lengths of fastme.ols and LS are indeed identical. So LS is not only for fun, but maybe also useful ;)

Klaus

Hi Klaus. Thanks for the many useful suggestions....

2011-03-30T00:12:10.175-04:00

Hi Klaus.

Thanks for the many useful suggestions. At some point I will try to add these to the function and see what happens.

I believe, if I'm not wrong, that fastme.ols() optimizes under the minimum evolution criterion using the least squares branch lengths; which is different than optimizing under the least squares criterion using same. See Felsenstein (2004; pp. 148-161) for a description of this distinction.

- Liam

Hi Liam, it seems you may have missed that there...

2011-03-29T15:37:03.297-04:00

Hi Liam,

it seems you may have missed that there are fastme.ols and fastme.bal functions are already in ape. These functions can do even SPR moves, but the names are not intuitive.

To speed up your LS code a few tricks. If you are lazy and do not want to implement the code of Joe or the code from Bryant and Waddell, you can just make use of the fact that X is highly sparse:
library(Matrix)
and
X <- Matrix(X)
somewhere in the code will already speed up your code and saves memory, especially for bigger matrices. The Matrix package also contains a vignette (in R type vignette("Comparisons") ), which tells you everything one wants to know about how to compute LS fast, e.g.:
v<-solve(crossprod(X), crossprod(X,colD))
is much faster than
v<-solve(t(X)%*%X)%*%t(X)%*%colD
a trick everybody should know.
There is also a function designTree in phangorn which does pretty much the same as phyloDesign and maybe is a bit faster. I will change it to the Matrix class in one of the next releases.
Speeding up the tree search on the other hand is tricky. The most expensive operation is likely the inversion of the matrix (t(X)%*%X). One could possibly use a partial inversion of this matrix as NNI moves only affect some blocks of this matrix.

You may also be interested in this small function to compute non-negative least-squares I wrote recently, following a discussion on R-phylo-sig:
#
# tree is a tree of class "phylo"
# dm a distance matrix
#
require(phangorn)
nnls.tree <- function(tree, dm){
dm = as.matrix(dm)
k = dim(dm)[1]
labels = tree$tip
dm = dm[labels,labels]
y = dm[lower.tri(dm)]
X = designTree(tree, method = "unrooted")
betahat = lm.fit(X,y)$coefficients
if(!any(betahat<0)){
tree$edge.length[] = betahat
return(tree)
}
n = dim(X)[2]
Dmat <- crossprod(X)
dvec <- crossprod(X, y)
Amat <- diag(n)
betahat <- solve.QP(Dmat,dvec,Amat)$sol
tree$edge.length[] = betahat
tree
}
I will also include it in the next phangorn version.

Regards,
Klaus

Hi Joe. I have read your paper but not the other ...

2011-03-26T14:37:49.718-04:00

Hi Joe.

I have read your paper but not the other one.

In my class I programmed the "hard" parts ahead of time, and then we learned how to create a custom function in R to do inference - so my goal was indeed simplicity.

Unfortunately, our function didn't work so I had to spend an hour or two after class debugging!

Even though my function is slow, if you give it the NJ tree as a starting tree it will produce a sensible result in reasonable order. (For instance, in the mammal tree the unweighted LS tree is only one NNI away from the NJ phylogeny.) Adding weights (e.g., 1/D[i,j] as in Fitch & Margoliash 1967) or changing to GLS would also be pretty straightforward.

Thanks for the comment!

There are faster algorithms for least squares ohyl...

2011-03-26T09:58:05.206-04:00

There are faster algorithms for least squares ohylogeny in two papers:

(1) My own iterative LS algorithm in my 1997 paper in Systematic Biology. It is fairly fast and has the convenient property of allowing us easily to keep branch lengths from going negative, if that's what we want. It involves a whole scheme for "pruning" distances on a tree, and can reuse parts of the calculation if a only small change of the tree is made (however in principle we must re-evaluate every branch length when even one branch's length is changed).

(2) Some algorithms by Bryant and Waddell in Molecular Biology and Evolution in 1998. These are even faster.

Both of these are much more complex to implement, so for a teaching example I sympathize with your choices here.