Friday, August 5, 2016

Update to phyl.pca for more columns than rows in input matrix

In response to a user request I have updated the phytools function phyl.pca to permit more columns (variables) than rows (species) in the input data matrix.

This is because though (for centred principal components analysis) it is one will always explain 100% of the variance with no more than N-1 principal components, it is possible to extract up to N-1 principal components for cases in which the input data matrix has more than N-1 columns.

The updates are not complicated - mostly I just compute min(n-1,m), for n species and m variables and then tell R to return only up to this quantity of principal components.

I also added an S3 plot method for objects of class "phyl.pca" (to existing biplot, print, and summary methods), which just returns a screeplot.

Here is an example in which I have a 10 taxon tree but 20 variables:

library(phytools) ## update from GitHub
packageVersion("phytools")
## [1] '0.5.44'
tree
## 
## Phylogenetic tree with 10 tips and 9 internal nodes.
## 
## Tip labels:
##  A, B, C, D, E, F, ...
## 
## Rooted; includes branch lengths.
X
##          [,1]       [,2]        [,3]        [,4]      [,5]       [,6]
## A 2.871213715  4.6747464 -0.40290368 -0.08741276  4.704137 -4.2622250
## B 1.042267133  3.8249301  0.65568345 -0.02805061  2.370947 -5.8588521
## C 0.004174414 -0.2874374  0.54262328  3.19664103  9.271929 -0.3564986
## D 2.976758314  1.8380804 -1.68282413  0.93237549  2.420743  2.5573102
## E 5.634111352 -4.7423357 -0.31085428 -2.50014841  5.786365 -0.2223702
## F 3.768230129 -4.4532704 -0.04018276 -1.82465733  5.547573 -0.2856634
## G 7.784903402 -0.7683049 -1.17327169 -0.34961072  4.334423  1.1144112
## H 1.643271999 -0.2554700 -0.49057157 -2.25651725  8.497341 -0.2871782
## I 3.923290527  0.2753995 -0.32539131  0.34383657  8.186258  0.8573116
## J 1.454268527 -6.0386162 -1.59563258 -2.28504513 -7.585861  3.1770481
##         [,7]       [,8]       [,9]     [,10]      [,11]      [,12]
## A  2.2477129 -1.6305340  1.8336705 -3.740515 -0.4025394  1.6017655
## B  2.1139692 -2.0213911  1.5040234 -2.910061 -0.8443350  3.1554476
## C  0.4520337 -1.0324254 -1.4895363  2.943384  3.6746002  2.1617860
## D  3.6463932 -1.7557318  0.8743329  1.663234 -1.3072798 -1.6035540
## E  0.3164341  6.8518210 -4.5161806  2.880459  3.3953585 -1.3370474
## F -1.3811630  6.7095858 -3.5541550  1.987218  2.2563118 -1.1472472
## G -0.7910001  3.7591620 -2.5568415  2.714351  2.4158234  1.3566271
## H -0.6734680 -0.4502042 -1.4317151  4.383923  4.1363432  0.9287238
## I -1.7199976  2.4663262 -2.0276715  3.273081  5.2843480  1.9080612
## J -2.3350235  1.0385213 -0.2371657 -1.755852  0.9089970 -6.8150985
##       [,13]      [,14]      [,15]      [,16]      [,17]     [,18]
## A -5.701040  7.3491364  4.6841996 -1.3305419  3.0704336 -5.471046
## B -9.124058  6.5838899  5.2174013 -2.7701790  1.1816007 -7.078690
## C -1.641395  0.7479561  4.1746009  0.3383891  0.5286383 -1.133129
## D -5.595636  4.8799468  2.6389452  1.0069082 -1.5137721  2.278655
## E -1.711472  7.5584553 -1.4630973  2.1636300  2.1653965  2.779019
## F -1.692711  4.9012769 -0.5412248  1.4585657  2.0301627  3.025220
## G -5.444037  9.6659874 -1.1024915  3.2488660  3.2064625  3.391553
## H -6.614349  2.8992088  1.5448734  2.0531905  1.9588168  1.510531
## I -9.444524  5.4065346  1.6282638  1.8551760  1.2510615  3.383527
## J  1.752094 -1.6821038 -2.3981097 -0.1649102 -2.1881646  4.575223
##        [,19]     [,20]
## A -2.8584588 -2.834560
## B -2.8935271 -2.134582
## C -0.8122405 -2.809604
## D  0.8635701  2.014487
## E -1.3748665 -3.749462
## F -0.4462016 -4.051456
## G -2.7122182 -3.882143
## H  1.4088890 -2.451189
## I -0.0321507 -2.765148
## J  2.3180704  4.245889
pca<-phyl.pca(tree,X)
summary(pca)
## Importance of components:
##                              PC1       PC2       PC3       PC4        PC5
## Standard deviation     6.1460367 5.5834580 4.6570712 3.5560098 3.03714064
## Proportion of Variance 0.3075852 0.2538527 0.1766042 0.1029677 0.07511125
## Cumulative Proportion  0.3075852 0.5614379 0.7380421 0.8410098 0.91612103
##                               PC6        PC7       PC8         PC9
## Standard deviation     2.10969194 1.62588269 1.4186935 1.092693580
## Proportion of Variance 0.03624209 0.02152552 0.0163890 0.009722366
## Cumulative Proportion  0.95236312 0.97388864 0.9902776 1.000000000
plot(pca)

plot of chunk unnamed-chunk-1

biplot(pca)

plot of chunk unnamed-chunk-1

Cool. You get the idea.

FYI, the data for this demo were simulated as follows:

library(clusterGeneration)
## simulate tree
tree<-pbtree(n=10,tip.label=LETTERS[1:10])
## simulate random positive definite covariance matrix
V<-genPositiveDefMat(20,covMethod="unifcorrmat")$Sigma
## simulate trait data
X<-sim.corrs(tree,vcv=V)

No comments:

Post a Comment