Friday, March 25, 2011

For fun: least squares phylogeny estimation

For my class this semester on "Applied Phylogenetic Methods" (which, based on a chat today with Ron Etter's postdoc Rob Jennings, I've now concluded should more properly be called "Phylogenetics Demystified"), I recently programmed an implementation of phylogeny estimation using the unweighted least squares method (on my R-phylogenetics page, code here [update, v0.2]) - originally developed by Cavalli-Sforza & Edwards (1967), and described in detail by Felsenstein (2004, which is also our textbook for the class). Felsenstein (2004) calls the least squares methods the statistically "best-justified" distance matrix methods (p. 148). To my knowledge there is no direct implementation of least squares in any R package [although it should be possible to call the fastME program from within R and minimum evolution uses the least squares branch lengths].

As the title of this post suggests, this implementation is really just for fun, and should not be considered a serious function for phylogeny inference! This is because it is really not programmed efficiently and is thus very, very slow for more than a modest number of taxa. It also uses only nearest neighbor interchange to search treespace (capitalizing on the nni() function of the very useful {phangorn} package) thus will tend to get stuck on "islands" of suboptimal trees (not connected to better trees by a single NNI) if the starting tree is not very good. Nonetheless, the function does appear to work. If we give a matrix containing the true evolutionary distances between species (i.e., the distances in the true tree), then the program will tend to find the correct topology and branch lengths given a reasonable starting tree.

The basic principle of least squares phylogeny estimation is straightforward. We would like to find a tree and set of branch lengths that minimizes the summed squared deviations between our observed distances (i.e., the set of distances that serve as input for our analysis) and the hypothesized distances (i.e., the patristic distances in our inferred tree).

So, in other words, our model is as follows: D = Xv+e; in which D is a vector of distances between species; X is a design matrix based on a hypothesized topology (to be discussed below); v is a vector of edge lengths; and e, of course, is our residual error.

Given a topology to solve for our estimated edge lengths we can just use our standard least squares estimating equation: v = (X'X)-1X'D. We can then calculate the sum of squares residual error (the quantity we want to minimize) for a given tree as: Q = (D-Xv)'(D-Xv).

So, there are two things we want to find. We want to find v that minimize Q conditioned a tree topology (given by X), which we can easily do using least squares estimation; but we also want to find the tree that minimizes Q given our least squares estimate of v.

We first need to compute X given a tree. I have not described the design matrix X to this point, but X is an N=n(n-1)/2 (for n species) by m=2n-3 (for a fully bifurcating unrooted tree) matrix. Here, you can probably see that N is just the number of distances between species; and m is the number of edges in the tree. We put a "1" in the Xij if the jth edge can be found on the path between the pair of species represented in the ith row of the matrix. To make this a little clearer, consider the unrooted, fully labeled tree at right and the design matrix calculated with my function phyloDesign() below:

> phyloDesign(tree)
    6,7 7,1 7,2 6,8 8,3 8,4 6,5
1,2   0   1   1   0   0   0   0
1,3   1   1   0   1   1   0   0
1,4   1   1   0   1   0   1   0
1,5   1   1   0   0   0   0   1
2,3   1   0   1   1   1   0   0
2,4   1   0   1   1   0   1   0
2,5   1   0   1   0   0   0   1
3,4   0   0   0   0   1   1   0
3,5   0   0   0   1   1   0   1
4,5   0   0   0   1   0   1   1

Now, with this design matrix we can compute our least squares branches and Q, and it is then only a matter of finding the tree (and thus the design matrix, X) that minimizes Q. To do this, I have used nni() from the {phangorn} package to design a greedy (but nonetheless remarkably slow) algorithm for finding the least squares tree. nni() returns all the one-step away NNI trees for a given tree. So, I just returned all these, compute X, v, and Q for each tree, pick the lowest one, and repeat until Q no longer improves from NNIs. Obviously, this algorithm will easily get stuck on low-lying "hills" in treespace. I'm not familiar with other more aggressive tree rearrangement functions available in R (for instance, TBR), but perhaps my readers are and if so, they can tell me!

To see how this works, let's take the distance matrix for 8 mammals given by Felsenstein (2004; p. 163), as follows:

> D.mammals
         dog bear raccoon weasel seal sea_lion cat monkey
dog        0   32      48     51   50       48  98    148
bear      32    0      26     34   29       33  84    136
raccoon   48   26       0     42   44       44  92    152
weasel    51   34      42      0   44       38  86    142
seal      50   29      44     44    0       24  89    142
sea_lion  48   33      44     38   24        0  90    142
cat       98   84      92     86   89       90   0    148
monkey   148  136     152    142  142      142 148      0

Now, let's load the function from source:

> source("")

This will actually load four functions:, ls.tree(), phyloDesign(), and compute.ancestor.nodes(). It will also load the packages {phangorn} (and, of course, {ape}) on first execution.

> lstree<,upgma(D.mammals))
[1] "1 set(s) of nearest neighbor interchanges. best Q so far = 143.854166666667"
[1] "2 set(s) of nearest neighbor interchanges. best Q so far = 98.8333333333332"
[1] "3 set(s) of nearest neighbor interchanges. best Q so far = 92.1166666666666"
best Q score of 92.1166666666666 found after 3 nearest neighber interchange(s).

Using the UPGMA tree as our starting tree. Then we can plot our tree and edge lengths:

> plot(midpoint(lstree)) # use midpoint rooting
> edgelabels(round(midpoint(lstree)$edge.length,2), adj=c(0.5,-0.2),frame="none")

which results in a tree that makes a fair bit of sense. Note that the zero length branch in this tree was actually negative in the LS tree, but my program sets negative branch lengths to zero by default (although this can be changed by the user).


  1. There are faster algorithms for least squares ohylogeny in two papers:

    (1) My own iterative LS algorithm in my 1997 paper in Systematic Biology. It is fairly fast and has the convenient property of allowing us easily to keep branch lengths from going negative, if that's what we want. It involves a whole scheme for "pruning" distances on a tree, and can reuse parts of the calculation if a only small change of the tree is made (however in principle we must re-evaluate every branch length when even one branch's length is changed).

    (2) Some algorithms by Bryant and Waddell in Molecular Biology and Evolution in 1998. These are even faster.

    Both of these are much more complex to implement, so for a teaching example I sympathize with your choices here.

  2. Hi Joe.

    I have read your paper but not the other one.

    In my class I programmed the "hard" parts ahead of time, and then we learned how to create a custom function in R to do inference - so my goal was indeed simplicity.

    Unfortunately, our function didn't work so I had to spend an hour or two after class debugging!

    Even though my function is slow, if you give it the NJ tree as a starting tree it will produce a sensible result in reasonable order. (For instance, in the mammal tree the unweighted LS tree is only one NNI away from the NJ phylogeny.) Adding weights (e.g., 1/D[i,j] as in Fitch & Margoliash 1967) or changing to GLS would also be pretty straightforward.

    Thanks for the comment!

  3. Hi Liam,

    it seems you may have missed that there are fastme.ols and fastme.bal functions are already in ape. These functions can do even SPR moves, but the names are not intuitive.

    To speed up your LS code a few tricks. If you are lazy and do not want to implement the code of Joe or the code from Bryant and Waddell, you can just make use of the fact that X is highly sparse:
    X <- Matrix(X)
    somewhere in the code will already speed up your code and saves memory, especially for bigger matrices. The Matrix package also contains a vignette (in R type vignette("Comparisons") ), which tells you everything one wants to know about how to compute LS fast, e.g.:
    v<-solve(crossprod(X), crossprod(X,colD))
    is much faster than
    a trick everybody should know.
    There is also a function designTree in phangorn which does pretty much the same as phyloDesign and maybe is a bit faster. I will change it to the Matrix class in one of the next releases.
    Speeding up the tree search on the other hand is tricky. The most expensive operation is likely the inversion of the matrix (t(X)%*%X). One could possibly use a partial inversion of this matrix as NNI moves only affect some blocks of this matrix.

    You may also be interested in this small function to compute non-negative least-squares I wrote recently, following a discussion on R-phylo-sig:
    # tree is a tree of class "phylo"
    # dm a distance matrix
    nnls.tree <- function(tree, dm){
    dm = as.matrix(dm)
    k = dim(dm)[1]
    labels = tree$tip
    dm = dm[labels,labels]
    y = dm[lower.tri(dm)]
    X = designTree(tree, method = "unrooted")
    betahat =,y)$coefficients
    tree$edge.length[] = betahat
    n = dim(X)[2]
    Dmat <- crossprod(X)
    dvec <- crossprod(X, y)
    Amat <- diag(n)
    betahat <- solve.QP(Dmat,dvec,Amat)$sol
    tree$edge.length[] = betahat
    I will also include it in the next phangorn version.


  4. Hi Klaus.

    Thanks for the many useful suggestions. At some point I will try to add these to the function and see what happens.

    I believe, if I'm not wrong, that fastme.ols() optimizes under the minimum evolution criterion using the least squares branch lengths; which is different than optimizing under the least squares criterion using same. See Felsenstein (2004; pp. 148-161) for a description of this distinction.

    - Liam

  5. Hi Liam,

    I believe at least in this case they are the same. I tried it a few times and when I finally managed to order the tree, the design and distance matrix it seemed to be the case. I found there is always a bit confusion around minimum evolution: minimum evolution as an optimality criterion and as an algorithm (heuristic) like fastME.
    Also fastME mostly comes as the weighted version - also very easy to implement - the weights are 2^(-rowSums(X)), I believe.
    Anyway, with the code above, it should be now easy to verify numerically, whether edge lengths of fastme.ols and LS are indeed identical. So LS is not only for fun, but maybe also useful ;)


  6. If I understand correctly, for a given tree topology the branch lengths are estimated by least squares for both LS and ME methods, but then the choice among topologies is made by the LS criterion in the first case, and by minimizing the sum of the branch lengths in the second. That might choose the same topology but is not guaranteed to do so. If it does come to the same topology as LS of course the branch lengths are the same. Note that in the ME case you do want to constrain branch lengths not to be negative.