Monday, September 17, 2012

Update to phyl.pca so that "mode" (correlation vs. covariance) can be specified flexibly

A user recently complained the following about phytools function phyl.pca for phylogenetic principal components:

One little suggestion for protecting users from themselves might to throw up an error if the user misspells the mode parameters [Ed. note: which specifies whether to use the covariance or correlation matrix during analysis]. Currently, if you type:
   mode="cor"
It will work, and just run on the default covariance matrix. But "cor" is a misspelling of "corr", which will yield the desired analysis. It might be a common error, since cov and cor are both three letters. I just did this, but noticed the output was identical.


Good suggestion. I have now added a few lines of code to phyl.pca to fix this problem. Basically, the function now accepts any unambiguous shorting of covariance and correlation instead of just the three and four letter strings "cov" and "corr" respectively. In addition, if the match is imperfect, the function will return a message.

The updated function code is here; it is also in the latest non-CRAN version of phytools (v0.1-94, here) and will be in the next CRAN phytools release.

The way I achieved this is really simple. Basically, I used the base function strsplit to split the input mode into characters; then I matched the specified mode characters to strsplit("correlation",split="") or strsplit("covariance",split=""). Finally, if the match was imperfect (for instance, mode="corelation"), the function defaults to using the covariance matrix, but also spits an error.

Here's a demo:

> # first install new phytools
> install.packages("phytools_0.1-94.tar.gz",type="source", repos=NULL)
Installing package(s) into ‘C:/Users/liam.revell/Documents/R/win-library/2.14’ (as ‘lib’ is unspecified)
* installing *source* package 'phytools' ...
> # load phytools
> library(phytools)
> # for demo, simulate tree & data
> tree<-pbtree(n=40,scale=100)
> L<-matrix(rnorm(n=16),4,4); L[upper.tri(L,diag=F)]<-0
> V<-L%*%t(L) # covariance matrix for simulation
> X<-sim.corrs(tree,V)
> X
          [,1]        [,2]         [,3]       [,4]
t1   10.0145660  25.5508968  27.14419359 -16.670596
t2   -0.9504677  -1.1381218  23.31553654 -16.947086
t3         ....
> pca<-phyl.pca(tree,X,mode="covar")
> pca<-phyl.pca(tree,X,mode="co")
mode = 'co' not a valid option; setting mode = 'cov'
> pcaCorr<-phyl.pca(tree,X,mode="corelation")
mode = 'corelation' not a valid option; setting mode = 'cov'
> pcaCorr<-phyl.pca(tree,X,mode="correlation")
> pca
$Eval
        PC1      PC2      PC3          PC4
PC1 5.612471 0.000000 0.000000 0.000000e+00
PC2     ....
$Evec
           PC1        PC2       PC3         PC4
[1,] -0.2026469  0.2849556 0.1557432  0.92383906
[2,] -0.5305595  0.6617452 0.3666632 -0.38230631
[3,] -0.4710700 -0.6810988 0.5603986  0.01227893
[4,]  0.6749325  0.1303764 0.7261237 -0.01457757
$S           PC1        PC2         PC3          PC4
t1  -24.731226  -0.357717  13.5570211  0.038474777
t2        ....

> pcaCorr
$Eval
        PC1      PC2       PC3          PC4
PC1 2.363718 0.000000 0.0000000 0.000000e+00
PC2     ....
$Evec
           PC1        PC2       PC3         PC4
[1,]  0.5945738  0.3526378 0.1810304  0.69954028
[2,]  0.6090435  0.3035288 0.1661745 -0.71366821
[3,]  0.2450079 -0.7643208 0.5960410  0.02280308
[4,] -0.4642405  0.4464644 0.7644274 -0.02830375
$S
          PC1         PC2         PC3          PC4
t1   16.705476  -3.6663012   7.3051355  0.040978169
t2        ....


That's it. Thanks to the phytools user for his comment. Keep them coming!

8 comments:

  1. Hi Liam,
    I try to run the phyl.pca function, but I keep on getting this error message:

    "Error in solve.default(kronRC, y - D %*% a) :
    system is computationally singular: reciprocal condition number = 3.02577e-028"

    Can't figure out what is wrong with the data. Do you have any idea? It does run the demo, so I assume my input files are somehow wrong. I have a tree in nexus format, and a trait file input in csv format (no missing data), with names linked to the traits (data <- traits[,2:7]
    rownames(data) <- traits$name).
    Hope you can help?

    Renske


    ReplyDelete
    Replies
    1. Hi Renske.

      Does your try have terminal branch lengths that are zero? That's one thing that will cause singularity of vcv.phylo(tree), which means the phylogenetic PCA cannot be computed. If not that, maybe send me data & tree and I would be happy to investigate further.

      - Liam

      Delete
  2. Hi Liam,
    I'm having the same issue with "Error in solve.default(kronRC, y - D %*% a) :
    system is computationally singular: reciprocal condition number = 1.90653e-23"

    I don't know what is going on. I have run phyl.pca before with no problems. I went to the data and there is no zeros or missing information. Any idea of other possibilities for this error?

    Thanks

    Jose

    ReplyDelete
    Replies
    1. Are there zero-length terminal edges in the tree? This makes vcv(tree) singular & can cause this error. The other possibility is that the covariance matrix among traits is singular, which happens if there are more columns (variables) than rows (species) in your input data matrix; or if some columns are perfect linear combinations of other variables.

      In at least the first case, one workaround suggested to me Eric Stone at NCSU, is to use the pseudoinverse:

      require(corpcor)
      solve<-function(x) pseudoinverse(x)
      pca<-phyl.pca(...)

      For this to work, though, you need to load phyl.pca & all internally called functions from source. This can be done using:

      source("http://www.phytools.org/utilities/v4.6/utilities.R")
      source("http://www.phytools.org/phyl.pca/v0.5/phyl.pca.R")

      Let me know if this works.

      All the best, Liam

      Delete
    2. Thank you Liam, I'll try it and let you know.

      Cheers

      Jose

      Delete
    3. Hi Liam,
      Unfortunately I keep on getting the same type of error. I don't have any zero length edges in the tree or more columns than rows in the data matrix.
      The data are partial warps from a geometric morphometric analysis. If I use that matrix and do a normal PCA I get identical results as from relative warps. I have no idea what may be the problem with the phyl.pca. Would you be willing to have a look at my data, please. Maybe I'm missing something.

      Cheers

      Jose

      Delete
  3. Hi Liam,

    I have the same problem. I don't have zero length,but I have a lot more variables than species: dim(df)
    6 163

    Is there a workaround for this situation as well?

    All the best and thank you for providing this great package!
    Daniel

    ReplyDelete
  4. This comes over a year later, but i had similar issues with phyl.pca,and found that, like liam suggested, this issue can arise when some variables are perfect linear combinations of other variables - this may be true of some climatic variables.

    Trawling through some other forums I found a solution that by adding a random small number (0.00001 - 0.00009) to your data will remove the perfect linear relationship between variables without changing the underlying data such that the results will be altered significantly.

    The other option is removing these variables altogether if you know (or can find out) which ones they are.

    Alex

    ReplyDelete