Tuesday, November 6, 2018

Computing phylogenetic PCA scores for individual data, when PCs were extracted from the correlation matrix & with Pagel's λ

A friend & colleague recently contacted me about computing principal components scores for individual data from a phylogenetic principal components rotation conducted using means. Not only that, but he wanted to do this based on the correlation matrix (rather than the covariance matrix), and using an optimized Pagel's λ transformation.

Firstly, if we were just using the covariance matrix, this would be relatively straightforward. We would first compute the PCs and then we could just multiply the (recentered) individual data by the matrix of eigenvectors from our phylogenetic PCA to get the scores. The only complication arises when we conduct the recentering, because we need to recenter using the vector of phylogenetic means from the species data (i.e., the reconstructed values for each trait at the root of the tree), rather than using a simple arithmetic mean of the species values for each trait.

If we want to use the correlation matrix for our PCA then things get a bit more complicated. This is because before computing our scores for individuals in the rotated space, we need to transform each trait to have a (phylogenetic) variance of 1.0, as well as recentering on the reconstructed ancestral states for the trait. This we can do by computing the (phylogenetic) variance of each trait, and then dividing the individual values by the square root of this variance.

For the following, Xm contains the species means data, Xi the data for individuals, and the rest should be fairly obvious!

First, I'm going to compute the scores for the species means data using phyl.pca and then manually (to show that they're the same), and then I will use the same algorithm to compute the scores for individuals.

## load library
library(phytools)
## visualize the data
dotTree(tree,Xm)

plot of chunk unnamed-chunk-1

## compute PCA
pca<-phyl.pca(tree,Xm,method="lambda",mode="corr")
pca
## Phylogenetic pca
## Standard deviations:
##       PC1       PC2       PC3       PC4       PC5 
## 1.5425533 1.1834761 0.8015111 0.6782314 0.3427766 
## Loads:
##                PC1        PC2         PC3         PC4         PC5
## [1,]  8.605186e-01 -0.1021593  0.04453666  0.47415186 -0.14922368
## [2,] -8.475339e-01 -0.1633525 -0.11016975  0.47117137  0.14443841
## [3,] -9.399335e-01  0.2144096  0.01911199  0.01798584 -0.26431948
## [4,] -7.748565e-06 -0.8393609 -0.52923628 -0.10504365 -0.06594048
## [5,] -1.928464e-01 -0.7829413  0.58978198 -0.04263434 -0.01235669
## lambda:
## [1] 0.6640637
## obtain among species & among trait phylogenetic covariance matrices
## with our fitted lambda
obj<-phyl.vcv(Xm[tree$tip.label,],vcv(tree),pca$lambda)
V<-obj$R
C<-obj$C
n<-nrow(Xm)
m<-ncol(Xm)
## transform the data such that the columns have phylogenetic 
## variances of 1.0
Y<-Xm/matrix(rep(sqrt(diag(V)),n),n,m,byrow=TRUE)
## check (should be 1.0s)
diag(phyl.vcv(as.matrix(Y)[tree$tip.label,],C,1)$R)
## [1] 1 1 1 1 1
## invert among species covariance matrix
invC<-solve(C)[rownames(Y),rownames(Y)]
## compute ancestral states to recenter
a<-matrix(colSums(invC%*%as.matrix(Y))/sum(invC),m,1)
A<-matrix(rep(a,n),n,m,byrow=TRUE)
Y<-Y-A
## check (should be 0.0s - to numerical precision)
phyl.vcv(Y[tree$tip.label,],C,1)$alpha[,1]
## [1]  2.234332e-18 -2.968470e-16  4.187777e-16 -3.115298e-16 -3.855819e-16
## as a check, compute scores on the original species means data
Sm<-Y%*%pca$Evec
plot(pca$S,Sm,pch=21,bg=make.transparent("blue",0.25),
    cex=1.5,xlab="PC scores from phyl.pca",
    ylab="PC scores from this exercise")

plot of chunk unnamed-chunk-2

OK, that seemed to work. Now all we need to do to compute our scores for individuals is perform exactly the same transformation of the individual data, and then rotate it using the eigenvectors of our among-species PCA:

## compute scores using the phyl.pca rotation on individual data
n<-nrow(Xi)
Yi<-Xi/matrix(rep(sqrt(diag(V)),n),n,m,byrow=TRUE)
A<-matrix(rep(a,n),n,m,byrow=TRUE)
Yi<-as.matrix(Yi)-A
Si<-Yi%*%pca$Evec
## done

For the above demo I didn't use my colleagues' data, of course. The data that I used were simulated as follows:

library(clusterGeneration)
tree<-pbtree(n=26,scale=1,tip.label=sample(LETTERS))
k<-5
V<-genPositiveDefMat(k,covMethod="unifcorrmat")$Sigma
X<-sim.corrs(phytools:::lambdaTree(tree,0.7),V)[LETTERS,]
N<-sample(1:5,nrow(X),replace=TRUE)
Xi<-apply(X,2,sampleFrom,xvar=0.2,n=N)
spp<-as.factor(rownames(Xi))
rownames(Xi)<-paste(rownames(Xi),unlist(sapply(N,function(n) 1:n)),
    sep="")
foo<-function(x,spp){
    coef<-coefficients(lm(x~spp))
    setNames(coef[1]+c(0,coef[2:length(coef)]),
        levels(spp))
}
Xm<-apply(Xi,2,foo,spp=spp)

I'll leave it to the reader to figure out how that worked!