Wednesday, May 20, 2015

Computing map.overlap for a set of trees

A R-sig-phylo participant recently asked how one goes about using the phytools function map.overlap to compute the set of similarities between stochastic mappings for a set of trees. map.overlap calculates the summed fraction edge lengths between two trees that are the same. She also wanted to do this if the topologies of different trees from stochastic mapping differed, so long as the 1st tree in set one, and the 1st tree in set two matched, along with the 2nd with the 2nd, and so on.

Well, this is not too challenging. If we have, say, N stochastic mappings on a tree for one character, and N stochastic mappings for a second, and so long as the two characters are coded in the same way (for instance, 0 / 1, A / B, 1 / 2 / 3, etc.), then it is pretty easy to compute the N × N matrix of map overlaps between the two sets:

## first let's simulate a tree & two a/b characters to use in 
## this demo
library(phytools)
tree<-pbtree(n=100,scale=1)
Q<-matrix(c(-1,1,1,-1),2,2)
rownames(Q)<-colnames(Q)<-letters[1:2]
x<-sim.history(tree,Q)$states
## Done simulation(s).
x ## character 1
##   t4  t65  t87  t88  t55  t56  t33  t45  t83  t84  t37  t43  t44  t41  t42 
##  "b"  "a"  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "b" 
##  t48  t93  t94  t81  t82   t1  t64  t91  t92  t66  t15  t16  t29  t30  t26 
##  "b"  "b"  "b"  "a"  "a"  "b"  "b"  "a"  "a"  "b"  "a"  "a"  "a"  "b"  "a" 
##  t79  t80  t21   t5   t6  t34  t35  t10  t11  t95  t96  t19  t28  t31  t32 
##  "a"  "a"  "b"  "b"  "a"  "a"  "a"  "b"  "a"  "b"  "b"  "b"  "b"  "b"  "b" 
##  t23  t24  t20  t75  t76  t36   t9  t69  t70  t73  t74  t46  t27  t77  t78 
##  "b"  "a"  "a"  "a"  "a"  "a"  "b"  "b"  "b"  "a"  "a"  "b"  "b"  "b"  "b" 
##  t71  t72  t39  t40  t99 t100   t7  t25  t85  t86  t57  t58  t22  t17  t97 
##  "b"  "b"  "a"  "a"  "a"  "a"  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "a"  "a" 
##  t98  t50  t49   t3  t18  t89  t90  t38  t14   t2  t47  t62  t63  t59  t51 
##  "a"  "a"  "b"  "a"  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "a" 
##  t52  t60  t61  t54   t8  t12  t13  t53  t67  t68 
##  "a"  "a"  "a"  "b"  "b"  "b"  "b"  "b"  "b"  "b"
y<-sim.history(tree,Q)$states
## Done simulation(s).
y ## character 2
##   t4  t65  t87  t88  t55  t56  t33  t45  t83  t84  t37  t43  t44  t41  t42 
##  "a"  "b"  "b"  "b"  "b"  "b"  "a"  "a"  "a"  "a"  "a"  "a"  "a"  "b"  "b" 
##  t48  t93  t94  t81  t82   t1  t64  t91  t92  t66  t15  t16  t29  t30  t26 
##  "b"  "b"  "b"  "a"  "a"  "b"  "a"  "a"  "a"  "a"  "b"  "a"  "a"  "a"  "b" 
##  t79  t80  t21   t5   t6  t34  t35  t10  t11  t95  t96  t19  t28  t31  t32 
##  "b"  "b"  "b"  "b"  "b"  "a"  "a"  "b"  "a"  "b"  "b"  "b"  "b"  "b"  "b" 
##  t23  t24  t20  t75  t76  t36   t9  t69  t70  t73  t74  t46  t27  t77  t78 
##  "b"  "b"  "b"  "a"  "a"  "a"  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "b"  "b" 
##  t71  t72  t39  t40  t99 t100   t7  t25  t85  t86  t57  t58  t22  t17  t97 
##  "b"  "b"  "a"  "a"  "a"  "a"  "a"  "a"  "a"  "a"  "a"  "a"  "b"  "a"  "b" 
##  t98  t50  t49   t3  t18  t89  t90  t38  t14   t2  t47  t62  t63  t59  t51 
##  "b"  "b"  "b"  "a"  "a"  "a"  "a"  "a"  "a"  "a"  "a"  "a"  "a"  "a"  "a" 
##  t52  t60  t61  t54   t8  t12  t13  t53  t67  t68 
##  "a"  "a"  "a"  "a"  "a"  "a"  "b"  "b"  "b"  "b"
## next we can generate a set of 100 stochastic maps for each
## character
mx<-make.simmap(tree,x,nsim=100)
## make.simmap is sampling character histories conditioned on the transition matrix
## Q =
##           a         b
## a -1.070102  1.070102
## b  1.070102 -1.070102
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
##   a   b 
## 0.5 0.5
## Done.
my<-make.simmap(tree,y,nsim=100)
## make.simmap is sampling character histories conditioned on the transition matrix
## Q =
##            a          b
## a -0.7767663  0.7767663
## b  0.7767663 -0.7767663
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
##   a   b 
## 0.5 0.5
## Done.
## compute the matrix of overlaps
overlap<-sapply(mx,function(x,y) sapply(y,map.overlap,x),y=my)
dim(overlap)
## [1] 100 100
mean(overlap)
## [1] 0.495982
hist(overlap,xlim=c(0,1),col="grey")
lines(c(mean(overlap),mean(overlap)),c(0,par()$usr[4]),lty="dashed",
    col="red",lwd=2)

plot of chunk unnamed-chunk-1

Note that this matrix is assymetric as overlap[i,j] gives the map similarity between mx[[i]] and my[[j]], in this case.

Alternatively, if we want to compute the vector of pairwise comparison, 1 vs. 1, 2 vs. 2, 3 vs. 3, and so on, we can do that using mapply:

overlap<-mapply(map.overlap,mx,my)
length(overlap)
## [1] 100
mean(overlap)
## [1] 0.4902131

Now, a separate question might be what the stochastic mapping similarities of two different characters might tell us about a correlation, or lack thereof, in their evolution process. To examine this, I thought I'd use some code that I posted earlier to simulate correlated evolution of two binary traits:

## values of Q to on average produce the same rates
## in x & y as before
Q<-matrix(c(0,0.8,0.8,0,3.2,0,0,3.2,3.2,0,0,3.2,0,0.8,0.8,0),4,4,byrow=TRUE)
rownames(Q)<-colnames(Q)<-c("aa","ab","ba","bb")
diag(Q)<--rowSums(Q)
tt<-sim.history(tree,t(Q))
## Note - the rate of substitution from i->j should be given by Q[j,i].
## Done simulation(s).
tt1<-mergeMappedStates(tt,c("aa","ab"),"a")
tt1<-mergeMappedStates(tt1,c("ba","bb"),"b")
tt2<-mergeMappedStates(tt,c("aa","ba"),"a")
tt2<-mergeMappedStates(tt2,c("ab","bb"),"b")
## these data are correlated, see:
par(mfrow=c(1,2))
plotSimmap(tt1,setNames(c("blue","red"),letters[1:2]),ftype="off",lwd=1)
plotSimmap(tt2,setNames(c("blue","red"),letters[1:2]),ftype="off",lwd=1,
    direction="leftwards")

plot of chunk unnamed-chunk-3

## pull the tip states out of our simulated histories
x<-getStates(tt1,"tips")
y<-getStates(tt2,"tips")

Now let's do stochastic mapping for both characters and then compute the mean & distribution of overlap in the stochastic. Obviously, as the characters are correlated, we might expect that they show a higher degree of overlap in the stochastic maps.

mx<-make.simmap(tree,x,nsim=100)
## make.simmap is sampling character histories conditioned on the transition matrix
## Q =
##            a          b
## a -0.8040389  0.8040389
## b  0.8040389 -0.8040389
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
##   a   b 
## 0.5 0.5
## Done.
my<-make.simmap(tree,y,nsim=100)
## make.simmap is sampling character histories conditioned on the transition matrix
## Q =
##            a          b
## a -0.9141153  0.9141153
## b  0.9141153 -0.9141153
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
##   a   b 
## 0.5 0.5
## Done.
overlap<-sapply(mx,function(x,y) sapply(y,map.overlap,x),y=my)
mean(overlap)
## [1] 0.7793841
hist(overlap,xlim=c(0,1),col="grey")
lines(c(mean(overlap),mean(overlap)),c(0,par()$usr[4]),lty="dashed",
    col="red",lwd=2)

plot of chunk unnamed-chunk-4

Interesting.

Now, does this tell us anything that Pagel's (1994) method (implemented now in the phytools function fitPagel wouldn't have - I don't know, but it is kind of neat to see in any case.

That's all for now.

Tuesday, May 19, 2015

Bug fix for phenogram when ftype="off"

In developing an exercise recently, I discovered a small bug that was introduced into the function phenogram when I changed (in the latest phytools) version the optional argument spread.labels to default from FALSE to TRUE. This bug causes the function to crash if labels are turned off using ftype="off" under the default conditions.

Here's how it manifests using simulated tree & data:

library(phytools)
packageVersion("phytools")
## [1] '0.4.56'
tree<-pbtree(n=26,tip.label=LETTERS)
x<-fastBM(tree)
phenogram(tree,x) ## works

plot of chunk unnamed-chunk-1

phenogram(tree,x,ftype="off") ## doesn't work
## Error in optim(zz, ff, yy = yy, mo = mo, ms = ms, cost = cost, method = "L-BFGS-B", : L-BFGS-B needs finite values of 'fn'

plot of chunk unnamed-chunk-1

phenogram(tree,x,ftype="off",spread.labels=FALSE) ## works

plot of chunk unnamed-chunk-1

The fix is pretty easy - I just check if ftype="off" and set spread.labels=FALSE if it is. I will put this in the next version of phytools. Here's how it works:

source("http://www.phytools.org/phenogram/v1.4/phenogram.R")
phenogram(tree,x)

plot of chunk unnamed-chunk-2

phenogram(tree,x,ftype="off") ## now works

plot of chunk unnamed-chunk-2

OK, that's it for now.

Thursday, May 14, 2015

About how ace(...,marginal=TRUE) does not compute marginal ancestral states [but ace(...,marginal=FALSE) does]

Just a quick note on this issue. The function ace for type="discrete" previously suffered from the issue that it did not (as many people nonetheless at the time believed) compute marginal ancestral states - but instead conditional probabilities based only on the information contained in the subtree descended from each node.

Thankfully, this has now been fixed - but the documentation of the function ace (as well as the unfortunate choice of argument names) still makes things much more confusing, in my opinion, then they need to be.

To summarize the main point of this post: ace(...,marginal=TRUE) does not compute marginal ancestral states - rather ace(...,marginal=FALSE) (the default) does.

In the documentation of ace, the argument marginal is defined in the following way:

marginal   a logical (relevant if type = "d"). By default, the joint reconstruction of the ancestral states are done. Set this option to TRUE if you want the marginal reconstruction (see details.)”

In details we find some more information:

“If marginal = TRUE, a marginal estimation procedure is used (this was the only choice until ape 3.1-1). With this second method, the likelihood values at a given node are computed using only the information from the tips (and branches) descending from this node. With the joint estimation, all information is used for each node.”

This gives us a hint that marginal=TRUE may not in fact be returning the marginal reconstruction - because the marginal ancestral states do not use information solely from the subtree descended from the node, rather marginal reconstruction asks (node by node) what is the most likely state for this node, integrating over all the possible states, over all the other nodes in the tree, in proportion to their probability. This differs from joint reconstruction not in using only a subset of the data (as implied) but in that joint reconstruction asks instead what is the set of states at all nodes with the highest likelihood. This set of states may or may not be the set of states obtained using marginal reconstruction (or, similarly, the set of states that individually have the highest scaled likelihoods / empirical Bayes posterior proabilities).

That ace(...,marginal=TRUE) does not give the marginal reconstruction (and, conversely, that ace(...,marginal=FALSE), the default, does give the marginal reconstructions) can be shown by comparison to other functions that do compute these reconstructions as follows:

## first simulate data & tree
tree<-pbtree(n=26,tip.label=LETTERS,scale=1)
Q<-matrix(c(-1,1,0,1,-2,1,0,1,-1),3,3)
rownames(Q)<-colnames(Q)<-letters[1:3]
tree<-sim.history(tree,Q)
## Done simulation(s).
x<-tree$states
x
##   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R 
## "b" "a" "c" "b" "c" "c" "b" "b" "a" "a" "b" "a" "b" "b" "b" "c" "c" "b" 
##   S   T   U   V   W   X   Y   Z 
## "a" "b" "a" "a" "a" "a" "a" "a"

Start with ace for marginal=TRUE & marginal=FALSE:

fit.ace.true<-ace(x,tree,type="discrete",model="SYM",marginal=TRUE)
fit.ace.false<-ace(x,tree,type="discrete",model="SYM",marginal=FALSE)
plotTree(tree,mar=c(0.1,0.1,5.1,0.1),offset=0.5)
nodelabels(pie=fit.ace.true$lik.anc,cex=0.7)
tiplabels(pie=to.matrix(x,seq=letters[1:3]),cex=0.5)
title(main="ace(...,marginal=TRUE)")

plot of chunk unnamed-chunk-2

plotTree(tree,mar=c(0.1,0.1,5.1,0.1),offset=0.5)
nodelabels(pie=fit.ace.false$lik.anc,cex=0.7)
tiplabels(pie=to.matrix(x,seq=letters[1:3]),cex=0.5)
title(main="ace(...,marginal=FALSE)")

plot of chunk unnamed-chunk-2

plot(fit.ace.true$lik.anc,fit.ace.false$lik.anc,
    xlab="ace(...,marginal=TRUE)",ylab="ace(...,marginal=FALSE)")

plot of chunk unnamed-chunk-3

Now we can compare to marginal ancestral state reconstruction implemented in other functions. In phytools, I have the function rerootingMethod (details here) that takes advantage of the re-rooting algorithm of Yang et al.:

fit.rrm<-rerootingMethod(tree,x,model="SYM")
plotTree(tree,mar=c(0.1,0.1,5.1,0.1),offset=0.5)
nodelabels(pie=fit.rrm$marginal.anc,cex=0.7)
tiplabels(pie=to.matrix(x,seq=letters[1:3]),cex=0.5)
title(main="rerootingMethod")

plot of chunk unnamed-chunk-4

We can compare this to ace(...,marginal=FALSE):

plot(fit.ace.false$lik.anc,fit.rrm$marginal.anc,
    xlab="ace(...,marginal=FALSE)",ylab="rerootingMethod")

plot of chunk unnamed-chunk-5

Finally, we can use Klaus Schliep's phangorn package as follows (code snippet provided here by Klaus):

library(phangorn)
X<-phyDat(as.matrix(x),type="USER",levels=c("a","b","c"))
fit<-pml(tree,X)
fit<-optim.pml(fit,optEdge=FALSE,optRate=TRUE,optQ=TRUE)
## optimize rate matrix:  -30.36139 --> -27.84751 
## optimize rate:  -27.84751 --> -26.61064 
## optimize rate matrix:  -26.61064 --> -26.40027 
## optimize rate:  -26.40027 --> -26.32162 
## optimize rate matrix:  -26.32162 --> -26.27694 
## optimize rate:  -26.27694 --> -26.2545 
## optimize rate matrix:  -26.2545 --> -26.24114 
## optimize rate:  -26.24114 --> -26.23345 
## optimize rate matrix:  -26.23345 --> -26.2286 
## optimize rate:  -26.2286 --> -26.22557 
## optimize rate matrix:  -26.22557 --> -26.22358 
## optimize rate:  -26.22358 --> -26.22228 
## optimize rate matrix:  -26.22228 --> -26.22139 
## optimize rate:  -26.22139 --> -26.22079 
## optimize rate matrix:  -26.22079 --> -26.22038 
## optimize rate:  -26.22038 --> -26.22009 
## optimize rate matrix:  -26.22009 --> -26.21989 
## optimize rate:  -26.21989 --> -26.21975 
## optimize rate matrix:  -26.21975 --> -26.21965 
## optimize rate:  -26.21965 --> -26.21957
fit.phangorn<-ancestral.pml(fit)
anc.ml<-t(sapply(1:tree$Nnode+Ntip(tree),function(i,x) x[[as.character(i)]][1,],x=fit.phangorn))
plotTree(tree,mar=c(0.1,0.1,5.1,0.1),offset=0.5)
nodelabels(pie=anc.ml,cex=0.7)
tiplabels(pie=to.matrix(x,seq=letters[1:3]),cex=0.5)
title(main="phangorn::ancestral.pml")

plot of chunk unnamed-chunk-6

plot(fit.ace.false$lik.anc,anc.ml,xlab="ace(...,marginal=FALSE)",
    ylab="phangorn::ancestral.pml")

plot of chunk unnamed-chunk-7

OK, well I guess that the point of this lengthy exercise is not to criticize ace, or ape (which is an incredible, impressive, multifunctional package at the core of most phylogenetics in R), but merely to point out that ace may not be doing exactly what you think it's doing - when you'd think it would be reasonable for it to be doing that thing. The good news is that with the default settings in recent versions of ape the reconstructed ancestral states returned are marginal ancestral states (aka. empirical Bayesian posterior probabilities), which is probably what we should be using for ancestral state reconstruction of discrete traits on the tree.