Friday, April 1, 2011

Phylogenetic size correction using GLS

I just posted a new function to my R-phylogenetics page called phyl.resid() (direct link to code here). This function takes as input a vector or matrix with one or multiple morphological traits (Y), a vector or matrix with one column containing size (x), and a tree; and performs phylogenetic size correction following Revell (2009).

The function can fit the model using two different error structures. It can use the error structure implied by simple BM; but it can also use a "lambda" error structure based on Pagel (1999).

Here, I show how to generate simulated data, and then fit the model and calculate residuals using my function. First, using standard BM:

> # first load the function from source
> source("phyl.resid.R")
> # load {geiger} for simulation
> require(geiger)
> # simulate a stochastic b-d tree
> tree<-drop.tip(birthdeath.tree(b=1,d=0,taxa.stop=101),"101")
> # generate values for size (x) on the tree
> size<-sim.char(tree,as.matrix(1))[,,1]
> # generate residual error on the tree
> e<-sim.char(tree,as.matrix(diag(0.1,10)))[,,1]
> # generate data for morphology (Y)
> morph<-0.75*size+e+1.0 # arbitrary slope & intercept
> # now compute phylogenetic regressions & residuals
> res.bm<-phyl.resid(tree,x=size,Y=morph)
> res.bm
$beta
[,1] [,2] [,3] [,4] [,5] [,6]
1.0505289 0.7406708 0.9833486 0.6723376 1.1780240 1.1162929
x 0.7686907 0.6957405 0.7890109 0.7772312 0.7821898 0.7752434
[,7] [,8] [,9] [,10]
0.5867387 1.1465067 1.2188155 0.9111480
x 0.7466242 0.8108946 0.6988025 0.6798143
$resid
[,1] [,2] [,3] [,4]
35 -0.57578507 -0.2825177189 -0.955908341 -0.353829729
53 0.13403231 -0.4167233466 -1.244141868 -0.360423877
54 -0.28546926 -0.2337143482 -1.060995881 -0.106119966
45 0.20879004 -0.1023593678 -1.016254850 -0.044159533
...
> # now compute the correlations between e & res.bm$resid
> diag(cor(e,res.bm$resid))
[1] 0.9975066 0.9842922 0.9955430 0.9969516 0.9968132 0.9963083
[7] 0.9999522 0.9778480 0.9911174 0.9643098


So, in this case we see that our estimated residuals (res.bm$resid) are highly correlated with our generating residuals (e).

Let's try this again, but with a different randomly generated value of lambda for each column of e. We can use the same tree & x as before:

> # generate residual error on the tree using lambdaTree
> e<-matrix(NA,100,10,dimnames=list(names(size)))
> # generate random lambda
> lambda<-runif(10)
> # simulate residual error
> for(i in 1:10)
+ e[,i]<-sim.char(lambdaTree(tree,lambda=lambda[i]), as.matrix(0.1))[,,1]
> # generate data for morphology (Y)
> morph<-0.75*size+e+1.0 # arbitrary slope & intercept
> # now compute phylogenetic regressions & residuals
> res.lambda<-phyl.resid(tree,x=size,Y=morph,method="lambda")
> res.lambda
$beta
[,1] [,2] [,3] [,4] [,5] [,6]
1.1946700 1.1669820 0.9899934 0.6516128 0.8610812 1.1297510
x 0.7843517 0.8268043 0.6847976 0.7268403 0.7880103 0.7544444
[,7] [,8] [,9] [,10]
0.9381019 1.1502370 0.7875310 0.9959001
x 0.7389826 0.7931595 0.6583763 0.7477538

$lambda
[1] 8.435166e-01 6.610696e-05 7.463530e-01 9.672567e-01
[5] 7.654335e-01 8.839716e-01 2.376671e-01 6.870145e-01
[9] 1.653223e-01 6.610696e-05

$logL
[1] -61.32796 -88.37910 -72.52449 -40.10453 -58.06409 -56.09014
[7] -80.70959 -84.00074 -92.85250 -94.28537

$resid
[,1] [,2] [,3] [,4]
35 0.10484806 -0.302784537 -1.07117426 -0.21346466
53 -0.56102358 -0.647005948 -0.51969545 -0.01657716
54 -0.02236203 -0.294930056 -0.50011842 -0.13512110
...
> # compute the correlation between lambda & res.lambda$lambda
> cor(lambda,res.lambda$lambda)
[1] 0.9417568
> # now compute the correlations between e & res.lambda$resid
> diag(cor(e,res.lambda$resid))
[1] 0.9944585 0.9705136 0.9816635 0.9967442 0.9923672 0.9998890
[7] 0.9993074 0.9947153 0.9643409 0.9999766


From this example we can see that our generating and estimated lambda values are highly correlated (remember, there are only 10 of these for the 10 replicate simulations); and our generating and estimated residuals are also very highly correlated as before.

In theory, you should be able to do all of this using gls() in {nlme}; however for some reason I have not been able to always get this to return the correct result (as assessed by comparison to PIC regression) or sensible estimates of lambda. Maybe some of my readers have had similar experiences?

10 comments:

  1. Liam,
    Does your function account for the differences varinces produced by the differences in tip height in non-ultrametric paleo-trees? (As we've discussed on the list.)

    Tangentially, since your comment in Evolution on this topic came out, I've been wondering about the application of phylogenetic size-correction and PCA. How problematic is the issue for paleontologists, who generally lack reliable phylogenies but commonly apply PCA and size-correction to their data?
    -Dave

    ReplyDelete
  2. Hi Dave.

    Yes, no problem to have a non-ultrametric tree. One should be able to do this using gls(...,weights) as well.

    Regarding, your second item - this would seem to be fundamentally an empirical question. However, in my opinion I think it will depend on the purpose of the PCA. For instance, if you ignore phylogeny then the PCs will not be evolutionarily orthogonal; however if the purpose is merely data reduction, this might not be of great consequence.

    Thanks for the great comment.

    - Liam

    ReplyDelete
  3. Liam-
    Hmm. The most common uses for PCA are for measuring disparity or visualizing morphospace occupation. Might be a good modeling project for a junior student to do: see what biases are introduced into disparity metrics by using traditional PCA.
    -Dave

    ReplyDelete
  4. Dave,
    I have also posted a new function for phylogenetic PCA. I describe it here and then (with the inevitable bug fixes) here.

    ReplyDelete
  5. Hello Liam,

    Thanks for the great package (phytools) & functions! I had a couple of questions that I hope are not too naive.

    1. I have been using different methods to generate residuals from a PGLS (as you describe in your post: http://blog.phytools.org/2012/11/fitting-model-in-phylogenetic.html) I have run your code as described below:


    METHOD A
    # using phytools
    library(phytools)
    # assuming your data are in named vectors x & y
    fit.A<-phyl.resid(tree,x,y,method="lambda")
    fit.A

    METHOD B
    # using caper
    library(caper)
    # assuming your data are in named vectors x & y
    y<-y[names(x)]
    X<-data.frame(x,y,Species=names(x))
    fit.B<-pgls(y~x,comparative.data(tree,X,"Species"), lambda="ML")
    residuals(fit.B)


    and have found that "residuals(fit.A)" are the same as the values in "fit.B". However,

    residuals(fit.B, phylo=T)

    provides a different answer and is explained as the "phylogenetic residuals from the PGLS model." I had assumed (probably due to a lack of knowledge on the subject) that the results of phyl.resid would be the "phylogenetic residuals" (i.e. the residuals from a phylogenetically informed GLS model).

    I would like to use the resulting residuals of the PGLS/phyl.resid in further phylogenetically informed analyses as the size corrected value for my trait of interest (keeping in mind the advice you give in your 2009 Evolution paper). It strikes me though that the "phylogenetic residuals" are the more appropriate metric to be using if one is after a phylogenetically size corrected trait. Might you have any insight into the differences between the result of "phyl.resid" residuals and the "phylogenetic residuals" from a PGLS model?

    2. Could you say a few words on the methods you use to ensure your analyses fit the assumptions of phyl.resid (keeping in mind the cautions you give in your 2010 "Phylogenetic signal and linear regression on species data" paper and ensuring that phylogenetic analysis is appropriate in the first place).

    For example, Charlie Nunn and Natalie Cooper suggest removing species that have a studentized residual greater than 3 or less than -3 (http://nunn.rc.fas.harvard.edu/groups/pica/wiki/d8009/751_Running_PGLS_in_R_using_caper.html). Do you think this is equivalently appropriate for the studentized residuals of the phyl.resid residuals to ensure that outliers are removed? Other thoughts on techniques that must be used to assure the data meet the appropriate assumptions?

    Many thanks for any advice you might be able to provide and apologies for anything that seems naive.

    ReplyDelete
  6. First, thanks a lot for this new function and all the explanations you give on this blog.

    I'm a student in evolutionary biology and i'm studying evolution of acoustic communication in a cricket genus.
    I'm working on acoustic, morphological, and acoustic data and i would like to know if there is some corelations between those characters.
    So, i had to partial out the effect of body size on these characters prior to conduct the analysis. I obtained the residuals size corrected of all my characters.
    Now, i would like to know correlations between those variables, but i don't know if i have to do "traditional regressions" or pGLS regression on the residuals.
    I would say, i have to do pGLS regression on the residuals in order to take into account phylogenetic effect, but i'm not sure and i would like to have your point of view.

    Many thanks for any advice you might be able to provide and apologies for anything that seems naive.

    Augustin

    ReplyDelete
    Replies
    1. Dear Dr. Revell,

      I just started to work with R in phylogenetic comparative methods, and I tried your phyl.resid with the BM option to correct for the body size effect in trait correlations. With the matrix of residues after size correction, I calculated the regressions between traits (parameters of cricket songs, the same used by Augustin) using the pGLS function (pGLS packages).

      I also calculated the lambda for my traits (using the log transformed values, not the residual matrix), and 4 out of 11 traits had values of lambda not significantly different from zero. According to this information, I recalculated the residual matrix using the lambda option in phyl.resid and calculated the regressions again with pGLS. The results are quite different, and I interpreted these differences based on the fact that some traits have lambda = 0, so the correction for phylogeny was not necessary for these traits, right?

      Here is your answer for Augustin´s question:

      "Hi Augustin.

      Yes, your instinct is correct. The residuals from phyl.resid (or, alternatively, using gls in the nlme package & residuals) are still phylogenetically correlated. This is because phylogeny is used to compute the optimal regression line, but the residuals are computed in the original space (not in a phylogenetically transformed space). Thus this is not in fact a 'double correction'.

      Hope this is helpful.

      All the best, Liam"


      Well, I´m trying to finish Augustin's article and help his advisor (Dr. Tony Robillard) with these calculations. Today he told me that I should use the corrected size matrix to calculate lambda since the size can impact the results. Well, I did it (used the phyl.resid with lambda correction), and 10 of my 11 traits have lambda not significantly different from zero, meaning that they do not have phylogenetic signal... Is that correct? And more important, why the phylogenetic signal disapeared if the phyl.resid does not correct for the phylogenetic relationships? I´m really confusing with these results and need your help.


      Thanks a lot, Karla.

      Delete
  7. Dear Liam

    I am new about phylogenetical comparative methods and apologize me if this question was made previously. I am trying to do a phylogenetic size-correction (with phyl.resid), then with the residuals obtained perform a phylogenetically PCA (with phyl.pca), and finally with the scores obtained make a manova ( with aov.phylo). I made this with the averages of morphological traits, but I am interested in take account intraspecific variation within a species.
    As a test I used individual measures and a tree with members of the same species as a polytomy (e.g. " ((bufonius1,bufonius2,bufonius3,bufonius4,bufonius5,bufonius6,bufonius7,bufonius8,bufonius9,bufonius10,(mystacinus1,mystacinus2,mystacinus3,mystacinus4,mystacinus5,mystacinus6,mystacinus7,mystacinus8,mystacinus9,mystacinus10,troglodytes1,troglodytes2,troglodytes3,troglodytes4,troglodytes5,troglodytes6,troglodytes7,troglodytes8,troglodytes9,troglodytes10))" with branch lengths=1 and grafen's branch length ( I don't have branch lengthfor my tree), apparently it works but I dont know if this is a correct form to obtain the residuals, scores etc.
    thanks a lot in advance
    best regards
    Regina

    ReplyDelete
    Replies
    1. I am also interested in this question. I think most studies ignore intraspecific error and just use the average.
      This study may be relevant:
      Ives, A. R., Midford, P. E., & Garland, T. (2007). Within-species variation and measurement error in phylogenetic comparative methods. Systematic Biology, 56(2), 252-270.

      Delete
  8. Dear Liam,

    I'm wondering whether it is acceptable to log-transform your variables before using this correction?

    Thanks

    ReplyDelete