Wednesday, September 7, 2011

Phylogenetic canonical correlation analysis

I just posted a new function to do phylogenetic canonical correlation analysis. Canonical correlation is a procedure whereby, given two sets of variables (say, a set of Xs and a set of Y), we identify the orthogonal linear combinations of each that maximize the correlations between the sets. This type of analysis is most naturally used in an evolutionary study to analyze, say, a set of morphological variables and a set of environmental or ecological variables.

The phylogenetic version of this analysis takes the phylogeny and a (explicit or implicit) model of evolution into account to find the linear combination of Xs and Ys that maximizes the evolutionary correlations (that is, the inferred correlation of evolutionary changes) between the two sets (Revell & Harrison 2008).

The program is very simple. Direct link to the code is here. To use the function, first load the source:

> source("")
> result<-phyl.cca(tree,X,Y)

Here, tree is a phylogenetic tree; and X & Y are two data matrices containing values for one or multiple characters in columns and species in rows. Rows should be named by species.

The results are returned as a list with the following elements:

> result
[1] 0.3764753 0.1852836 0.1054606
[1,] 0.04497549 -0.09956576 -0.45926364
[2,] -0.18997199 0.46065246 -0.07810429
[3,] -0.42425815 -0.16063677 -0.18791902
[4,] 0.25374826 0.29822455 -0.06176255
[1,] -0.2704762 -0.3841450 0.1029158
[2,] -0.1048448 0.2502089 0.5860655
[3,] 0.3736474 -0.2580132 0.2743137
1 0.27821077 -0.33344726 0.94985154
2 -0.23088044 0.78905936 0.26050453
3 -1.44525534 -0.22803129 -0.64071476
1 0.55710619 -0.850905958 0.300282830
2 1.41482268 1.237829442 -0.446763906
3 -1.40453596 0.227361557 -0.964307876
[1] 8.9531203 2.0752824 0.5032912
[1] 0.7069293 0.9126462 0.7775202

Here, $cor is the set of canonical correlation; $xcoef & $ycoef are the canonical coefficients; $xscores & $yscores are the canonical scores, in terms of the original species; and $chisq & $p are Χ2 with corresponding p-values. The p-values are properly interpreted as the probability that the ith and all subsequent correlations are zero.

A few years ago I released a C program that does more or less the same thing; however there are a few differences.

1) My C program globally optimizes the λ parameter. I will add this to the present function promptly.

2) My C program first transforms the data into a phylogeny free space, and then computes the canonical correlations. This means that, although the correlations are the same in both methods, the scores are no longer in terms of species and will be different than in this function.

It is also possible to conduct a non-phylogenetic canonical correlation analysis with this function by giving phyl.cca() a star phylogeny with unit length. To do this with our previous example, we can just do the following:

> result<-phyl.cca(starTree(species=tree$tip.label, branch.lengths=rep(1,length(tree$tip))),X,Y)

This should produce the same result as cc() in the "CCA" package. (Note, though, that for cc() the rows of X & Y must be in the same order - they do not for phyl.cca(), but must contain row names.)

This function will be added to the next update of "phytools".


  1. Hi Liam,

    very nice your blog, I got introduced to many new concepts through it and phytools and of course many new difficulties..

    I came across your function “phyl.cca” while searching ways to account for species non-independance in multivariate analysis. I am surprised that I haven't been able to find many references of the phylogenetic comparative methods out of ecological studies, it seems a pretty standard problem to me, at least in comparative genomics that I am working on.

    I experimented with your function looking for associations among genomic data (for instance protein family presence/absence for various species, “phylogenetic profiles”), these practically being my dependent variables, and various morphological traits. I understand that probably canonical correlation is not suited to this kind of categorical variables, with both the matrices being binary and I got an error during the computation of canonical correlations in your function,
    Error in solve.default(SigXX)
    Lapack routine dgesv: system is exactly singular: U[782,782] = 0

    which I guess has to do with the 0,1 nature of my tables.

    My question is, do you see any straight forward way to adapt and generalise your implementation to other kind of multivariate techniques (co-inertia analysis for instance or logistic regression)? Could you point out related reads that could help?

    Thanks in advance, keep up the good comparative work!


  2. could you please provide the data arrangement of X and Y. I understand how to arrange X so far but how do you arrange Y. it is not clear.

    1. The function arguments X and Y should be two different numerical matrices with row names that match the species labels of the tree.

    2. Thank you very much for your reply.
      For example I have a tree, a plot data (I arrange as X, species vs. plot) and 19 climate variables of each plot (that I want to set as Y). But the problem is the species are repeating in some plots and can't have the unique value for each species. In this case how can I arrange for Y?


Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.