This is working towards something else, but I thought I’d post a minimal function to do phylogenetic PCA (following Revell 2009) on the variance-covariance matrix using phylogenetically independent contrasts.
First, here’s the function.
phyl_pca<-function(tree,X,model="BM"){
pX<-apply(X,2,pic,phy=tree)
vcv<-t(pX)%*%pX/(Ntip(tree)-1)
a<-apply(X,2,function(x,tree) rep(ace(x,tree,
method="pic")$ace[1],Ntip(tree)),tree=tree)
eig<-eigen(vcv)
Eval<-diag(eig$values)
colnames(Eval)<-rownames(Eval)<-paste("PC",1:nrow(Eval),sep="")
Evec<-eig$vectors
rownames(Evec)<-colnames(X)
colnames(Evec)<-colnames(Eval)
S<-(X-a)%*%Evec
pic_corr<-function(x,y,tree){
px<-pic(x,tree)
py<-pic(y,tree)
mean(px*py)/sqrt(mean(px^2)*mean(py^2))
}
L<-apply(S,2,function(x,y,tree) apply(y,2,pic_corr,
x=x,tree=tree),y=X,tree=tree)
dimnames(L)<-dimnames(Evec)
object<-list(Eval=Eval,Evec=Evec,S=S,L=L,V=vcv,
a=a[1,,drop=FALSE],mode="cov",call=match.call())
class(object)<-"phyl.pca"
object
}
Now, let’s use simulated data to establish that it gives the same result as phyl.pca
in phytools.
library(phytools)
tree<-pbtree(n=26,tip.label=LETTERS)
X<-fastBM(tree,nsim=5)
ppca_old<-phyl.pca(tree,X)
ppca_old
## Phylogenetic pca
## Standard deviations:
## PC1 PC2 PC3 PC4 PC5
## 1.30539 1.14825 0.81916 0.69441 0.54724
## Loads:
## PC1 PC2 PC3 PC4 PC5
## [1,] -0.447049 -0.647115 -0.13345 0.517316 -0.30978
## [2,] -0.188977 -0.422767 0.86683 0.081185 0.16607
## [3,] -0.339527 0.870107 0.24518 0.177473 -0.18981
## [4,] -0.024847 -0.337630 0.25756 -0.684094 -0.59251
## [5,] 0.972326 0.030731 0.11605 0.168092 -0.10916
ppca_new<-phyl_pca(tree,X)
ppca_new
## Phylogenetic pca
## Standard deviations:
## PC1 PC2 PC3 PC4 PC5
## 1.30539 1.14825 0.81916 0.69441 0.54724
## Loads:
## PC1 PC2 PC3 PC4 PC5
## [1,] -0.447049 -0.647115 -0.13345 0.517316 -0.30978
## [2,] -0.188977 -0.422767 0.86683 0.081185 0.16607
## [3,] -0.339527 0.870107 0.24518 0.177473 -0.18981
## [4,] -0.024847 -0.337630 0.25756 -0.684094 -0.59251
## [5,] 0.972326 0.030731 0.11605 0.168092 -0.10916
Straight away, we see that we’re getting the same loadings (evoluationary correlations between our five PCs and the original traits). If we inspected the scores
head(scores(ppca_old))
## PC1 PC2 PC3 PC4 PC5
## A -0.94060 0.64396 1.248968 0.864546 0.62631
## B -0.60360 1.64493 0.917616 0.851302 -0.15091
## C -1.70219 0.91533 1.900368 -0.162792 0.29883
## D -0.78124 1.51660 -0.560986 0.011550 -1.10221
## E -1.18014 2.19275 -0.055891 -0.020694 -0.71089
## F -1.89954 3.63193 -0.560573 -0.059178 0.51366
head(scores(ppca_new))
## PC1 PC2 PC3 PC4 PC5
## A -0.94060 0.64396 1.248968 0.864546 0.62631
## B -0.60360 1.64493 0.917616 0.851302 -0.15091
## C -1.70219 0.91533 1.900368 -0.162792 0.29883
## D -0.78124 1.51660 -0.560986 0.011550 -1.10221
## E -1.18014 2.19275 -0.055891 -0.020694 -0.71089
## F -1.89954 3.63193 -0.560573 -0.059178 0.51366
We see the same thing.
Our plotting methods work too.
par(mfrow=c(1,2))
plot(ppca_new)
biplot(ppca_new)
So what’s the advantage of using contrasts – if we just get the same result? Well, for one contrasts are much faster.
This becomes evident when we simulate a much larger tree & more characters.
tree<-pbtree(n=2000)
X<-fastBM(tree,nsim=10)
system.time(ppca1<-phyl.pca(tree,X))
## user system elapsed
## 1.92 0.03 5.25
system.time(ppca2<-phyl_pca(tree,X))
## user system elapsed
## 0.05 0.00 0.09
Are the results the same? (I’ll change options()$digits
here just so that we can see all our loadings from each analysis in one blog column. I promise I’m not hiding any differences!)
options(digits=2)
ppca1
## Phylogenetic pca
## Standard deviations:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
## 1.05 1.03 1.01 1.00 0.98 0.98 0.97 0.96 0.96 0.94
## Loads:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
## [1,] 0.42 0.03 0.146 0.35447 0.492 0.167 0.208 0.048 -0.25 -0.547
## [2,] 0.27 0.37 0.409 -0.37599 -0.284 0.495 0.374 -0.093 -0.02 0.102
## [3,] -0.17 0.63 -0.016 -0.00034 -0.085 0.074 -0.342 0.391 0.38 -0.389
## [4,] -0.23 0.53 0.033 -0.30188 0.422 -0.286 -0.026 0.040 -0.52 0.212
## [5,] -0.32 0.20 -0.246 0.17329 0.482 0.272 0.309 -0.330 0.46 0.208
## [6,] 0.18 -0.10 0.076 -0.53549 0.220 -0.061 -0.407 -0.570 0.18 -0.293
## [7,] -0.20 -0.25 -0.199 -0.52941 0.059 -0.223 0.549 0.303 0.14 -0.334
## [8,] -0.31 -0.36 -0.062 -0.22299 0.226 0.669 -0.298 0.295 -0.21 0.038
## [9,] 0.70 0.08 -0.525 -0.17564 0.150 0.034 -0.081 0.263 0.12 0.284
## [10,] 0.11 -0.20 0.676 -0.02729 0.358 -0.196 -0.060 0.330 0.37 0.280
ppca2
## Phylogenetic pca
## Standard deviations:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
## 1.05 1.03 1.01 1.00 0.98 0.98 0.97 0.96 0.96 0.94
## Loads:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
## [1,] -0.42 0.03 -0.146 0.35447 0.492 0.167 -0.208 0.048 -0.25 0.547
## [2,] -0.27 0.37 -0.409 -0.37599 -0.284 0.495 -0.374 -0.093 -0.02 -0.102
## [3,] 0.17 0.63 0.016 -0.00034 -0.085 0.074 0.342 0.391 0.38 0.389
## [4,] 0.23 0.53 -0.033 -0.30188 0.422 -0.286 0.026 0.040 -0.52 -0.212
## [5,] 0.32 0.20 0.246 0.17329 0.482 0.272 -0.309 -0.330 0.46 -0.208
## [6,] -0.18 -0.10 -0.076 -0.53549 0.220 -0.061 0.407 -0.570 0.18 0.293
## [7,] 0.20 -0.25 0.199 -0.52941 0.059 -0.223 -0.549 0.303 0.14 0.334
## [8,] 0.31 -0.36 0.062 -0.22299 0.226 0.669 0.298 0.295 -0.21 -0.038
## [9,] -0.70 0.08 0.525 -0.17564 0.150 0.034 0.081 0.263 0.12 -0.284
## [10,] -0.11 -0.20 -0.676 -0.02729 0.358 -0.196 0.060 0.330 0.37 -0.280
Neat. How big can we go?
tree<-pbtree(n=10000)
X<-fastBM(tree,nsim=65)
system.time(ppca<-phyl_pca(tree,X))
## user system elapsed
## 10.28 0.16 20.19
biplot(ppca)
It works! More later….
Very interesting! Just a fast question, does similar approaches also apply for a phylogenetic Partial Least Squares regression ? Can I simply perform a normal Partial Least Squares on the matrix of PICs ?
ReplyDeleteThanks a lot,
Domenico