Phylogenetic Tools for Comparative Biology: New, totally rewritten version of make.simmap; new phytools build

Sunday, March 17, 2013

New, totally rewritten version of make.simmap; new phytools build

I just posted a totally rewritten version of the phytools function make.simmap for stochastic character mapping on trees. Rich Glor reported that there might be a bug in older versions of make.simmap because the posterior probabilities at internal nodes were sometimes different than the posterior probabilities computed using Jonathan Bollback's stand-alone Mac OS X program SIMMAP. That make.simmap and SIMMAP might give different results was not super concerning to me, because they do slightly different things. Specifically, 1) make.simmap is meant to sample the root state from the conditional distribution at the root, rather from the conditional distribution × the stationary distribution - as in equation (2) of Bollback (2006); and 2) make.simmap is meant to condition on the MLE of Q, the instantaneous transition matrix, rather than sampling transition rates among states from a user-specified prior distribution. [I believe I'm right about these two aspects of SIMMAP; please correct me if I have any details wrong.]

Anyway, Rich's email inspired me to take a close look at the code, which I'd written originally in mid-2011 and had been little modified since. I did find one bug - specifically, in computing probabilities to sample states from the multinomial, I had rescaled (incorrectly) by the maximum, rather than the sum, of the product of the conditional likelihoods × the transition probabilities - equation (3) in Bollback (2006). I.e.:

node.states[j,2]<-rstate(p/max(p))

in older versions of make.simmap (e.g., here), should have been:

node.states[j,2]<-rstate(p/sum(p))

This would not have mattered at all if I'd been smart and used the base function rmultinom, which automatically rescales probabilities internally to sum to 1.0 - but instead I used a custom phytools function rstate which (aside from also being computationally less efficient), did not. (rstate still exists, but now it calls rmultinom internally.)

The significance of this bug will depend on the specific case. In some empirical datasets that I ran, the effect on the posterior density (computed using densityMap) was small - but this should probably not be assumed.

In going closely over the code of make.simmap I also realized it was due a major overhaul. This is hardly surprising as I wrote the original version of this function "way back" in mid-2011 when I was only really starting to become adventurous in R after switching over from doing mostly everything in C. In re-writing make.simmap I've 'encapsulated' (so to speak) all the different steps of the mapping process into different internally called functions. I also removed the dependency of make.simmap on the 'ape' function ace for computing the conditional likelihoods - although in this case by borrowing liberally from the ace code (with attribution, of course) in a different internally called function, apeAce.

I also took advantage of the re-write to modify deviation 1) from Bollback's SIMMAP. Now the user has the option of assuming equal state frequencies, inputting arbitrary state frequencies, or estimating frequencies by numerically solving for the stationary distribution conditioned on Q. By multiplying our conditional likelihoods at the root by the stationary distribution, we are basically putting a prior on the root state that is equal to the stationary distribution implied by the fitted model. I'm not totally sure whether we want to do this or not. Feedback welcome.

One interesting case of user-defined state frequencies is the practice of setting the frequency at 1.0 for one state and 0.0 for the others. What this will result in is a set of stochastic maps conditioning on a particular value (the one with probability 1.0) at the root node of the tree - which might be interesting for some problems.

For make.simmap(...,message=TRUE) (the default), the function will now also spit back the value for the transition matrix, Q and the prior on the state frequencies at the root (π) used during simulation.

Finally, this updated version of make.simmap runs about 30-40% faster than the prior version (even with the specific bug-fix identified above), which is not trivial.

I also built a new version of phytools, phytools 0.2-27, which can be downloaded and installed from source in the usual way.

Here's a quick demo of use of make.simmap. [Note that the input matrix Q for simulation in sim.history is the transpose of the printed matrix Q in make.simmap. In the literature on molecular evolution models and Markov chains in general there is some discrepancy about whether the rows or columns should sum to 0.0 by convention (the only difference being whether we need to post- or pre-multiply e^Qt, respectively, by π^(t) to get π^(t+1)).]

> require(phytools)
> packageVersion("phytools")
[1] ‘0.2.27’
> tree<-pbtree(n=200,scale=1)
> Q<-matrix(c(-1,1,1,-1),2,2)
> rownames(Q)<-colnames(Q)<-letters[1:nrow(Q)]
> tree<-sim.history(tree,Q,anc="a")
> mtrees<-make.simmap(tree,tree$states,model="ER", nsim=100,pi="estimated")
make.simmap is sampling character histories conditioned on the transition matrix
Q =
a b
a -0.7876848 0.7876848
b 0.7876848 -0.7876848
(estimated using likelihood);
and state frequencies (used to sample the root node following Bollback 2006)
pi =
a b
0.5 0.5
> cols<-setNames(c("blue","red"),letters[1:nrow(Q)])
> plotSimmap(mtrees[[1]],cols,pts=F,ftype="off")

Now let's compute the state frequencies from the stochastic maps for each internal node. These are our posterior probabilities for nodes:

> # function to compute the states
> foo<-function(x){
+ y<-sapply(x$maps,function(x) names(x)[1])
+ names(y)<-x$edge[,1]
+ y<-y[as.character(length(x$tip)+1:x$Nnode)]
+ return(y)
+ }
>
> AA<-sapply(mtrees,foo)
> piesA<-t(apply(AA,1,function(x,levels,Nsim) summary(factor(x,levels))/Nsim,levels=letters[1:3], Nsim=100))
> plot(tree,no.margin=TRUE,show.tip.label=F)
> nodelabels(pie=piesA,cex=0.6,piecol=cols)

Finally, let's compare to the generating map:

> # first compute the number of changes of each type
> XX<-countSimmap(mtrees,message=FALSE)
> colMeans(XX)
N a,a a,b b,a b,b
28.65 0.00 10.88 17.77 0.00
> # compare to the true history
> countSimmap(tree,message=FALSE)
$N
[1] 27
$Tr
a b
a 0 11
b 16 0
> # now compute the mean overlap with the true history
> xx<-sapply(mtrees,map.overlap,tree)
> mean(xx)
[1] 0.9144295

I think that we can be pretty confident that make.simmap is doing what it's supposed to do; however please keep in mind that this version is brand new and has not been thoroughly tested. And, as Rich showed in pointing out this potential error to me, it is always wise to look behind the curtain - if possible - and don't hesitate to report any bugs, issues, or questionable results to me ASAP. I'll do whatever I can to address them.

25 comments:

Liam RevellMarch 17, 2013 at 5:58 PM
If anyone is interested in comparing make.simmap to SIMMAP, they should check out the phytools function export.as.xml which will create a SIMMAP input file.

In my very non-scientific comparison yesterday I found that SIMMAP, under the default conditions for morphology, sampled the same posterior probabilities at nodes as make.simmap - but that the number of changes in SIMMAP were too large (i.e., the 95% distribution did not include the generating number of changes). This might be just a matter of tuning the priors, or it might reflect something deeper. At this point, I just don't know.
ReplyDelete
Replies
ElisaSeptember 13, 2013 at 12:38 PM
Hy Liam,

thank you for making stochastic mapping available for R users.
Buy unfortunatelly I'm not getting it. I'll obtain the Q matrix as the output of make.simmap (and to use it in sim.history I have to transpose of the printed matrix Q of make.simmap), right? But when I try to run make.simmap I get the following message: "subscript out of bounds". I don't know what I'm doing wrong. The tree is in phylo format and I'm using SYM model. Can you please give me and example of the vector X ("a vector containing the tip states for a discretely valued character") so I can compare with mine? It is a simple vector like x=c("1","1","0","1")? One more question, how can I indicate missing information on the tip states?
ReplyDelete
Replies
UnknownMarch 21, 2014 at 9:04 AM
Hello, hope this question goes in the right place ;) Am I right that from describe.simmap I can get the transition rates simply by dividing the numbers of different transitions by the total number of all transitions? Sorry if this is silly but I'm still relatively new to comparative stuff and wan't just find similarities between phytools output and BayesTraits which I used so far.
ReplyDelete
Replies
UnknownFebruary 2, 2016 at 4:39 PM
Is there a way to fix specific transition rates in Q while allowing all others to be estimated at their most likely values?
ReplyDelete
Replies
Bernardo SantosJuly 10, 2016 at 11:51 AM
This comment has been removed by the author.
ReplyDelete
Replies
Bernardo SantosJuly 10, 2016 at 11:52 AM
Hi Liam,

Thanks for the detailed explanation. In terms of counting how many times a given transformation (e.g. 0->1) has occured, I was wondering if there is a way to generate a histogram if probability distributions for the number of changes based on the joint reconstruction.

I have used countSimmap, which returns an integer value for the number of changes, but I'm not sure how this number is calculated from the probability distributions for each node. You mention the possibility of generating histograms in your "Lecture_5.1" on ancestral state reconstructions, but I can't figure out a way of generating such an output.

Thanks a lot in advance!
ReplyDelete
Replies
AndrésMarch 24, 2017 at 1:48 PM

Hi Liam,

I am working on an example to give to a class, I am using this format for trait data (as I previously used)

Species 1
Species 2

and reading it through

> dietario <- read.table("fruti.txt", row.names = 1)

Then I converted this with the following

> XXX <- dietario[,1]
> names(XXX) <- row.names(dietario)
> XXX -> dietas

Then I got this error

> make.simmap(arbol, dietas)
Error in xx[tree$tip.label, ] : subscript out of bounds

The "arbol" and data had the same taxa

Thanks for your help!
ReplyDelete
Replies
UnknownMay 10, 2019 at 2:21 PM
Hello, I am having the same error, but I am pretty sure it is not due to incorrect matching of tip labels and species labels in my data, because i have removed all species in the data that do not match the tree$tip.labels. the error I receive is
"Error in xx[tree$tip.label, ] : subscript out of bounds"

I am happy to provide more information if needed.
ReplyDelete
Replies
Jamie ThompsonJuly 10, 2019 at 3:56 PM
Hi, how would I go about returning the node (or even better, the distance from the root) at which a trait is acquired? I.e. the first occurrence.
ReplyDelete
Replies
UnknownAugust 12, 2019 at 1:11 PM
Hello,

I have the same error as many of you already had with make.simmap:

Error in xx[tree$tip.label, ] : subscript out of bounds

But I can't figure it out, even with all previous posts.

My tree is of class "phylo" and my character has fivedifferent state ("G","g","f","d","s") and is stored in different object (I tried vector, data frame and matric), but not in my phylo object as in anoletree. My rownames of my object containing my character states correspond to those of my tip labels.

Does someone could help me ? I might missing a step here.

PS: I noticed that anoletree is both phylo and simmap class. Is it normal?
ReplyDelete
Replies
Pooja C NathanApril 10, 2021 at 8:20 AM
Hi everyone, I had the same issue as many of the folks above. I used commands as follows:

data <- data.frame(Species, character, row.names = 1)
dat1 <- setNames(data[,1], rownames(data))
sim <- make.simmap(tree, dat1, model = "ARD")

This worked for me and I was able to obtain the simmap. Hope this is useful!
ReplyDelete
Replies
BelindaJanuary 26, 2022 at 6:47 AM
Hi everyone,

I encounter an error when using the "summary" function for the simmaps I created for a set of trees (100 trees). "make.simmap" works without problems but as soon as I want to obtain the "summary" of my maps and map it onto my best tree (highest likelihood in RAXML) I receive the following error:
error in YY[jj, i] : invalid indextyp 'list'. However, I checked the output of "make.simmap", which is class 'multiSimmap' and not a list. I also checked that the labels in my tree set and the best tree match and they do...not sure what might be the problem?! Anyone having similar issues?

Here the code I used:
mtrees_seed<-make.simmap(trees,seed,model="ARD", nsim=1)
obj_seed<-summary(mtrees_seed,ref.tree=best)

Any help would be greatly appreciated! Thanks a lot!

All the best,
Belinda
ReplyDelete
Replies
BelindaJanuary 27, 2022 at 3:54 AM
Dear Liam,

thanks for the quick response. Yes, I am happy to share my data with you. However, I did not receive an email from you...I will send you the data & script to your university email address. Thanks a lot again for your help.

Best wishes,
Belinda
ReplyDelete
Replies

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.

Pages

Sunday, March 17, 2013

New, totally rewritten version of make.simmap; new phytools build

25 comments: