Tuesday, May 13, 2014

contMap with missing data

Today a phytools user emailed with the following question:

"I am still having difficulty running an analysis on the evolution of a continuous character with a partial data set (I have character states for about 50% of the taxa). I would like to display the results using contMap, but since fastAnc is built into the code, it won't map the character because the # of taxa in the data set doesn't match the # of taxa in the tree. I attempted to replace fastAnc with the modified version of anc.ML from one of your blog posts (link), but I couldn't get that to work either. I received the same error message as with fastAnc:
Error in ace(x, a, method = "pic") : length of phenotypic and of phylogenetic data do not match
Is there any way to use
contMap with missing data? Any help would be greatly appreciated. Thank you for your time."

They are totally on the right track, but things get a little trickier than they might have anticipated because whereas fastAnc (when used in it's default mode) just returns a vector of estimated character values at nodes, anc.ML (when some taxa are present in the tree but missing from the input data) returns a list with the vector missing.x containing the MLE phenotypic trait values of taxa with missing data. It also happens to be the case that I never got around to adding this updated version of anc.ML to the phytools package.

Well, I have now made both updates so it is possible to (1) specify an inference method (method="fastAnc" or method="anc.ML"); and (2) phytools function anc.ML now permits missing data, in which case the states for the tip taxa in the tree missing from the data vector will be estimated using ML. The code for these updates are here: anc.ML, contMap, but since both use internal functions not in the phytools namespace it is probably wise to download the latest phytools build and install from source.

Here's a quick demo:

> packageVersion("phytools")
[1] ‘0.4.7’
> ## simulate tree & data
> tree<-pbtree(n=26,tip.label=LETTERS)
> x<-fastBM(tree)
> ## compute limits on x to use in both plots
> lims<-c(floor(min(x)*10)/10,ceiling(max(x)*10)/10)
> ## simulate missing tip values
> y<-sample(x,20)
> ## which taxa are missing
> setdiff(tree$tip.label,names(y))
[1] "H" "I" "O" "T" "W" "Z"
> ## contMap from full dataset
> contMap(tree,x,lims=lims,legend=1.5)
> ## contMap from reduced dataset
> contMap(tree,y,lims=lims,method="anc.ML",legend=1.5)

OK - that's it.

14 comments:

  1. Do you think it might be worthwhile to annotate the tips with missing data so which are missing is immediately apparent? Maybe give them a white bubble or something at the very tip?

    ReplyDelete
    Replies
    1. This can be done easily already by the user because contMap works with ape tiplabels and nodelabels. Good suggestion.

      Delete
  2. Amazing work! Beautiful.

    A general question:
    Would it be possible in contMap to specify our own colorramp instead of stuck with the default red-to-blue?

    ReplyDelete
    Replies
    1. This is not too hard. I just blogged about it here. Let me know if you have any difficulties. - Liam

      Delete
  3. Hello Dr. Revell,

    I am very new to R and have been using it for a few weeks. I have a dataset of continuous characters which has missing data. When I used the example above, I could not load the data file(which in .csv format) onto the tree. and also when i ran the code, I got an error which said Error in ace(x, a, method = "pic") :
    length of phenotypic and of phylogenetic data do not match.
    In addition: Warning message:
    In mean.default(x) : argument is not numeric or logical: returning NA.

    Also when I use the setdiff function, the missing value taxa are not shown correctly, i.e the ones which have missing values arent shown, instead its random all the time.
    Here is the code I ran:
    library(ape)
    library(phytools)
    tree<- read.nexus("data1.nexus")
    data<- read.csv("lhpfam.csv")
    a<-extract.clade(tree, node=91)
    plot(a)
    x <- fastBM(a)
    col2=2
    y<- data
    contMap(a,y,method="anc.ML",legend=0.5)

    Here is the dataset : https://onedrive.live.com/redir?resid=6d3c647fc5b8119c%21138

    ReplyDelete
  4. (Sorry if this is a duplicate; my other comment didn't go through)

    I have just started using contMap too! I just got it to work for missing data and here were the pitfalls that I fell into...

    (1) Continuous character data should be formatted as row #1 as species names, and row #2 as continuous character data. I think you might need to transpose your data matrix.

    (2) If you read in your data file as read.csv, it's read in as a data frame and that throws an error "Error in A[i, 1]/YY[j] : non-numeric argument to binary operator" So I used the unlist() function to convert it into a vector... which seemed to make the function happy!

    Overall, here is my code:

    tree <- read.tree("Bayes_rooted.tre")
    protodata <- read.csv("protodata.csv", header=TRUE)
    protovector <- unlist(protodata)
    lims<-c(floor(min(protovector)*10)/10,ceiling(max(protovector)*10)/10)
    contMap(tree, protovector, lims=lims, legend=1.5, method="anc.ML")

    Hopefully that helps and isn't secretly misleading! (Please correct me if I'm wrong!) Thanks Dr. Revell for writing such an awesome function!

    ReplyDelete
    Replies
    1. Hello Stephanie,

      I did try the code you've posted after transposing my data, but I am still getting the same error .

      Delete
    2. Hi Liam,

      I also have a tree with more taxa than the data of the continuous variable I want to reconstruct. I was glad to see that there is a possibility to do it using method="anc.ML" in contMap but I couldn't and I got the following error:

      Error in ace(x, a, method = "pic") :
      NA/NaN/Inf in foreign function call (arg 5)

      I really don't understand what's going on. Could you guide me? thank you.

      Delete
  5. Hi Liam,

    I also have a tree with more taxa than the data of the continuous variable I want to reconstruct. I was glad to see that there is a possibility to do it using method="anc.ML" in contMap but I couldn't and I got the following error:

    Error in ace(x, a, method = "pic") :
    NA/NaN/Inf in foreign function call (arg 5)

    I really don't understand what's going on. Could you guide me? thank you.

    ReplyDelete
  6. Which symbol can I use for missing data in the csv matrix for discrete characters? Can I use the "?" symbol?

    ReplyDelete
  7. Dear all, I am new to Phytools and R, and I am trying to reconstruct continuous ancestral states.

    I could import the tree in Newick format (with branch lenghts, not ultrametric), and imported a csv file containing trait value but with missing data.
    RPCS_copy_number<-read.csv("Males.csv",row.names=1)

    Then I converted it into a vector
    RPCS_copy_number<-as.matrix(RPCS_copy_number)[,1]

    But the problem appears when I try to use the FastAnc function:
    fit<-fastAnc(mtDNA.tree,RPCS_copy_number,vars=TRUE,CI=TRUE)

    It prompts: Error in ace(x, a, method = "pic") :
    NA/NaN/Inf in foreign function call (arg 5)

    I understand the problem of having missing data in the character matrix is solved, but it seems I am wrong. Can you help me?





    ReplyDelete
    Replies
    1. I do not know what the problem is -- but here are some things you could try:

      1) Resolve polytomies on your tree using multi2di. This shouldn't be the issue, but it is worth checking.

      2) Verify that your data are a simple numeric vector with names.

      3) Verify that your tree does not have any terminal edge lengths of zero length -- and, in particular, any pairs of taxa with different states for your trait that are separated by zero distance on the reconstructed tree.

      4) Verify that you use the argument method="anc.ML" for your contMap.

      - Liam

      Delete
  8. Dear all, I am new to Phytools and R, I also got error fastAnc:
    Error in ace(x, a, method = "pic") : length of phenotypic and of phylogenetic data do not match this. Then I use "anc.ML" method, but unfortunately I got another error: Error in solve.default(C) :
    Lapack routine dgesv: system is exactly singular: U[11,11] = 0.

    Has anyone encounter this error? What should I do to solve this problem? Thanks so much!

    ReplyDelete
    Replies
    1. My guess is that this is either because of a mismatch between the taxa in your tree and those in your character vector, or because your character vector does not have names. The easiest way to solve this is probably using the setNames function in base R.

      Delete

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.