Tuesday, April 12, 2011

Fixes and improvements to phyl.pca()

I have posted a couple of fixes and improvements to phyl.pca().

First, the original function had a small bookkeeping bug - that has been fixed.

However, more significantly, I made a number of improvements to the function that cut the run-time on a sample dataset by better than 1/2.

First, I added some front-end error checking that I have been trying to incorporate into some of my functions. Here, I check to see if the data (in Y) have rownames, and if the rownames match the tree tip labels. In addition, however, I also allow an incomplete match, so long as all names in Y are found in the tree. To accomplish this, I just retain or prune rows and columns from C<-vcv.phylo(tree) based on their presence or absence in Y. That is:

C<-vcv.phylo(tree)[rownames(Y),rownames(Y)]

Next, I tried to improve the computational performance of the function. I did this in a few different ways. I minimized the number of calls to the {ape} function vcv.phylo() - which can be quite slow. I did this by first computing C<-vcv.phylo(tree), and then passing C, not the phylogeny, to phylo.vcv() [my function to compute the evolutionary variance-covariance matrix among traits] and likelihood() [my function to compute the likelihood for method="lambda"].

Finally, and this had the biggest impact of all, I halved the number of times I had to compute the n*m × n*m Kronecker product of C & R<-phylo.vcv(...)$R; and, wherever it made sense, I switched out solve(A)%*%B in favor of solve(A,B) [as suggested by Vladimir Minin]. The number of times I could do this was fairly small, as I had previously discovered that although solve(C,X) is faster than solve(C)%*%X, as suggested by Vladimir, if we have to compute invC<-solve(C) anyway, then invC%*%X becomes faster than solve(C,X). That said, I was able to use his suggestion with the largest matrix in the analysis (that is, the aforementioned Kronecker product). This certainly helped a lot because for each call of solve(A,b) (instead of solve(A)%*%b) in our likelihood function, we save nearly 3/4 of a second.

The updated function (v0.2) is on my R-phylogenetics page (direct link to code here).

4 comments:

  1. I am glad I could help.

    I have a nagging feeling that some of your computations may be sped up even further if you use the Felsenstein's independent contrasts transformation, but I cannot put my finger on it.

    ReplyDelete
  2. Yes, I agree. I think that is true.

    The only issue is the likelihood optimization of lambda. I could circumvent the problem by using Luke Harmon et al.'s lambdaTree() function and then computing contrasts from the transformed tree; however lambdaTree() is not super fast.

    Nonetheless, as the number of taxa (n) or traits increases (m) [thus making the n*m × n*m Kronecker product matrix very large] the relative slowness of lambdaTree() would not really matter as matrix inversion will be very slow.

    ReplyDelete
  3. Hi,

    I think Vladimir is correct, one can use contrasts to fit the model and considerably speed computation. I have some code (too long to paste here) that fits lambda using this approach - more than happy to share it. Gavin Thomas also has implemented the same idea in his Motmot package.

    The idea is based on a wonderful paper by Felsenstein in the 1970s(!) in which he showed how to fit Brownian models without having to do the matrix inversion which would have been even more computationally prohibitive in those days.

    Best,

    Rob F

    ReplyDelete
  4. Hi Rob. Yes, definitely. My main concern was that transforming the tree using lambdaTree() is slow. It turns out (I figured out today) that this transformation can be done much more efficiently than is done by lambdaTree(). I will share code for this in a subsequent post.

    ReplyDelete