Wednesday, November 29, 2023

Workflow for phylogenetic canonical correlation analysis using phytools phyl.cca

A colleague recently asked me about the phytools function phyl.cca for phylogenetic canonical correlation analysis. Canonical correlation analysis is a technique whereby, given two different matrices of continuous variables (X and Y), one can identify a set of derived variables (the canonical variables), each a linear combination of X and Y, that have the maximum correlation. Each canonical variable is orthogonal to the preceding one. Accounting for phylogeny is done in the typical way such that the correlation to be maximized is the evolutionary correlation, and each canonical variable is evolutionarily orthogonal to the preceding one. Typically, we might use canonical correlation analysis to analyze a pair of data matrices of different types – for instance, in which X contained a set of environmental trait, and Y a set of morphological features (and so on).

phyl.cca actually replaces a much older stand-alone binary software (PCCA) that I described in a paper with Alexis Harrison way back in 2008. I offered to provide my colleague with a complete workflow illustrating the function. The dataset & tree that I’ll use for this come from the website of my recent book with Luke Harmon. I’ll use the files from Chapter 2 of the book BarbetTree.nex and Barbetdata.csv, which I have downloaded to my local working directory. These data are originally from an article by Gonzalez-Voyer et al. (2013) and consist of a phylogeny and trait dataset for Asian barbets (Megalaimidae).

After loading our packages, we can proceed to read in both files as follows.

## load packages
library(phytools)
library(geiger)
## read tree
barbet_tree<-read.nexus(file="BarbetTree.nex")
barbet_tree
## 
## Phylogenetic tree with 42 tips and 41 internal nodes.
## 
## Tip labels:
##   M_asiatica_asiatica, M_lineata_hodgsoni, M_australis_australis, M_haemacephala_intermedia, M_eximia, M_viridis, ...
## 
## Rooted; includes branch lengths.

(Here I used the function read.nexus from the ape package to read in the tree. Had it been in simple Newick format, I would have used read.tree instead.)

## read data
barbet_data<-read.csv(file="Barbetdata.csv",row.names=1)
head(barbet_data)
##                                          wing    Lnalt    patch   colour  Frequency      Length Lnote
## Calorhamphus_fuliginosus_fuliginosus 4.388257 5.298317 2.000000 1.666667  20.468894   3.1207387 0.260
## Calorhamphus_fuliginosus_hayi        4.427239 5.298317 2.000000 1.666667  22.483670   3.1371727 0.230
## M_armillaris                         4.532599 7.170120 6.333333 4.000000  -7.135924 -11.9001412 0.030
## M_asiatica_asiatica                  4.611152 6.802395 7.333333 5.000000 -10.153448   0.3671016 0.025
## M_asiatica_davisoni                  4.605170 7.003065 6.666667 3.333333  -9.700025   0.4492719 0.030
## M_australis_duvaucelli               4.282206 6.620073 9.000000 4.000000   1.206501  -2.5198843 0.030

As you might be able to guess by looking at it, these data consist of several variables for morphology (wing, patch, and colour), several variables for song (Frequency, Length, and Lnote), and one variable for ecology (Lnalt). What I’ll plan to do is subdivide our dataset into one matrix for morphology and a second for song, and then run phylogenetic canonical correlation analysis on these two matrices.

First, though, I need to do a little additional bookkeeping. I remember from our book that this barbet tree contains several species that are not represented in our the data. We’ll prune these taxa out before proceeding further.

chk<-name.check(barbet_tree,barbet_data)
summary(chk)
## 9 taxa are present in the tree but not the data:
##     M_asiatica_chersonesus,
##     M_australis_australis,
##     M_haemacephala_celestinoi,
##     M_haemacephala_delica,
##     M_haemacephala_intermedia,
##     M_haemacephala_rosea,
##     ....
## 
## To see complete list of mis-matched taxa, print object.
pruned.barbet_tree<-drop.tip(barbet_tree,chk$tree_not_data)
name.check(pruned.barbet_tree,barbet_data)
## [1] "OK"

Now let’s make our two matrices for the canonical analysis. For good measure, I’ll throw in the variable Lnalt into my “morphological” data matrix.

barbet_morph<-barbet_data[,c("wing","patch","colour","Lnalt")]
head(barbet_morph)
##                                          wing    patch   colour    Lnalt
## Calorhamphus_fuliginosus_fuliginosus 4.388257 2.000000 1.666667 5.298317
## Calorhamphus_fuliginosus_hayi        4.427239 2.000000 1.666667 5.298317
## M_armillaris                         4.532599 6.333333 4.000000 7.170120
## M_asiatica_asiatica                  4.611152 7.333333 5.000000 6.802395
## M_asiatica_davisoni                  4.605170 6.666667 3.333333 7.003065
## M_australis_duvaucelli               4.282206 9.000000 4.000000 6.620073
barbet_song<-barbet_data[,c("Frequency","Length","Lnote")]
head(barbet_song)
##                                       Frequency      Length Lnote
## Calorhamphus_fuliginosus_fuliginosus  20.468894   3.1207387 0.260
## Calorhamphus_fuliginosus_hayi         22.483670   3.1371727 0.230
## M_armillaris                          -7.135924 -11.9001412 0.030
## M_asiatica_asiatica                  -10.153448   0.3671016 0.025
## M_asiatica_davisoni                   -9.700025   0.4492719 0.030
## M_australis_duvaucelli                 1.206501  -2.5198843 0.030

Now, we’re ready for our analysis using phyl.cca. This works as follows.

barbet_cca<-phyl.cca(pruned.barbet_tree,barbet_morph,
  barbet_song)
barbet_cca
## 
## Object of class "phyl.cca" from a phylogenetic canonical
##    correlation analysis.
## 
## Summary of results:
##      correlation       X^2  P-value
## CC:1    0.576934 19.512138 0.076896
## CC:2    0.424662  8.179298 0.225260
## CC:3    0.298394  2.611137 0.271018
## 
## Assumed or estimated value of lambda:
##    lambda      logL 
##    1.0000 -256.5318 
## 
## Canonical x coefficients:
##              CA1        CA2       CA3
## wing   15.777525 -18.997694 -4.784766
## patch  -0.512648   0.048314 -0.331544
## colour  0.480505  -0.772164 -1.282279
## Lnalt   3.927334   1.871524 -2.759275
## 
## Canonical y coefficients:
##                 CA1       CA2        CA3
## Frequency -0.118156  0.580255  -0.096996
## Length     0.213158  0.139234   0.646897
## Lnote     31.778271 -5.674240 -21.987228

That’s really all there is to it! We can see from the results that perhaps our first canonical correlation is marginally significant – but none of the others. If we had included the argument fixed=FALSE in our phyl.cca call we would have also estimated a value for Pagel’s \(\lambda\) which would have been used in subsequent calculations. In this case, I know the ML value of \(\lambda\) is very close to 1.0, so it doesn’t make any difference.

For fun, here is a phylo.heatmap plot of all of our variables.

phylo.heatmap(pruned.barbet_tree,barbet_data,fsize=c(0.6,0.7,0.9),
  standardize=TRUE,pts=FALSE,colors=viridisLite::viridis(n=20),
  split=c(0.6,0.4))

plot of chunk unnamed-chunk-10

No comments:

Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.