Wednesday, October 18, 2017

plotTree.barplot in an array or with a custom layout

I just pushed a very small update to the phytools function plotTree.barplot.

The update allows for the optional argument add, which, when set to TRUE, will plot the tree & barplot in the next two open figures - rather than opening a new split figure using par(mfrow=c(1,2)).

Normally, we don't want this - because it will cause our tree & barplot to be plotted sequentially (to the same device) rather than side-by-side. E.g.:

library(phytools)
x<-fastBM(tree<-pbtree(n=26,tip.label=LETTERS,scale=1),a=2)
plotTree.barplot(tree,x,add=TRUE)

plot of chunk unnamed-chunk-1plot of chunk unnamed-chunk-1

However, it might be handy if, for instance, we wanted to create an array of plotTree.barplot plots - which we might do as follows:

X<-fastBM(tree,nsim=4,a=2,bounds=c(0,Inf))
par(mfrow=c(2,4))
colnames(X)<-paste("x",1:4,sep="")
nulo<-apply(X,2,plotTree.barplot,tree=tree,add=TRUE)

plot of chunk unnamed-chunk-2

Or, with the column names as axis labels:

par(mfrow=c(2,4))
nulo<-mapply(function(x,name,tree) plotTree.barplot(tree,x,add=TRUE,
    args.barplot=list(xlab=name)),lapply(1:ncol(X),function(i,x) x[,i],x=X),
    colnames(X),MoreArgs=list(tree=tree))

plot of chunk unnamed-chunk-3

Neat.

Finally, the other thing we could do with this is modify the amount of space we leave for the barplot & tree respectively. This could be particularly useful if our barplot contains a lot of information. For, here I'll use a 'stacked barplot' style for all four columns of X:

layout(matrix(c(1,2),1,2),widths=c(0.3,0.7))
plotTree.barplot(tree,X,add=TRUE)

plot of chunk unnamed-chunk-4

Tuesday, October 17, 2017

Phylogenetic 'Anscombe' datasets

Since phytools contains a range of different methods for visualizing trait evolution on trees, I have been thinking a bit about the value of plotting our data, rather than just fitting sophisticated models & reporting the results (as is quite common).

The importance of doing this kind of graphing in statistics is now fairly well appreciated; however, this has not always been the case. For instance, in 1973 Francis Anscombe illustrated the importance of graphing with his famous 'quartet,' published in a paper designed to counter the impression among statisticians that “numerical calculations are exact, but graphs are rough” (source).

The Anscombe quartet is a set of four bivariate datasets that are virtually identical in any descriptive statistic (means, regression coefficients, F-values, r2, P-values, and so on), but that when graphed reveal clearly different patterns of covariation between the x & y.

This is what I mean. [Note that all of the following is modified, substantially - but not in substance, from the documentation of the datasets package.]

library(datasets)
attach(anscombe)
anscombe
##    x1 x2 x3 x4    y1   y2    y3    y4
## 1  10 10 10  8  8.04 9.14  7.46  6.58
## 2   8  8  8  8  6.95 8.14  6.77  5.76
## 3  13 13 13  8  7.58 8.74 12.74  7.71
## 4   9  9  9  8  8.81 8.77  7.11  8.84
## 5  11 11 11  8  8.33 9.26  7.81  8.47
## 6  14 14 14  8  9.96 8.10  8.84  7.04
## 7   6  6  6  8  7.24 6.13  6.08  5.25
## 8   4  4  4 19  4.26 3.10  5.39 12.50
## 9  12 12 12  8 10.84 9.13  8.15  5.56
## 10  7  7  7  8  4.82 7.26  6.42  7.91
## 11  5  5  5  8  5.68 4.74  5.73  6.89
fits<-list()
for(i in 1:4){
    model<-paste("y",i,"~x",i,sep="")
    cat(paste("Model",i,"\n--------",model,"--------\n"))
    fits[[i]]<-lm(model)
    cat("\n")
    print(coef(fits[[i]]))
    print(anova(fits[[i]]))
    cat("\n-----------------------\n\n")
}
## Model 1 
## -------- y1~x1 --------
## 
## (Intercept)          x1 
##   3.0000909   0.5000909 
## Analysis of Variance Table
## 
## Response: y1
##           Df Sum Sq Mean Sq F value  Pr(>F)   
## x1         1 27.510 27.5100   17.99 0.00217 **
## Residuals  9 13.763  1.5292                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## -----------------------
## 
## Model 2 
## -------- y2~x2 --------
## 
## (Intercept)          x2 
##    3.000909    0.500000 
## Analysis of Variance Table
## 
## Response: y2
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## x2         1 27.500 27.5000  17.966 0.002179 **
## Residuals  9 13.776  1.5307                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## -----------------------
## 
## Model 3 
## -------- y3~x3 --------
## 
## (Intercept)          x3 
##   3.0024545   0.4997273 
## Analysis of Variance Table
## 
## Response: y3
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## x3         1 27.470 27.4700  17.972 0.002176 **
## Residuals  9 13.756  1.5285                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## -----------------------
## 
## Model 4 
## -------- y4~x4 --------
## 
## (Intercept)          x4 
##   3.0017273   0.4999091 
## Analysis of Variance Table
## 
## Response: y4
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## x4         1 27.490 27.4900  18.003 0.002165 **
## Residuals  9 13.742  1.5269                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## -----------------------

Now let's plot them:

par(mfrow=c(2,2),mar=c(4.6,4.1,1.1,1.1),oma=c(0,0,3,0))
for(i in 1:4){
    model<-paste("y",i,"~x",i,sep="")
    plot(as.formula(model),pch=21,
        bg=phytools::make.transparent("orange",0.5),cex=2,xlim=c(3,19),
        ylim=c(3,13))
    abline(fits[[i]],lwd=2,col=phytools::make.transparent("blue",0.25))
}
mtext("Anscombe's 4 Regression data sets",outer=TRUE,cex=1.5,line=0.5)

plot of chunk unnamed-chunk-2

So what of the relevance to phylogenetic data? Well, we could easily imagine a scenario in which very different character patterns could result in the same fitted evolutionary model. For instance, take the following tree & three continuous characters, x, y, & z:

library(phytools)
library(geiger)
tree
## 
## Phylogenetic tree with 26 tips and 25 internal nodes.
## 
## Tip labels:
##  A, B, C, D, E, F, ...
## 
## Rooted; includes branch lengths.
data.frame(x,y,z)
##          x        y        z
## A 1.833108 2.962696 2.835180
## B 1.872561 3.073251 2.898368
## C 1.936456 3.614328 2.647927
## D 2.279403 1.682714 2.535930
## E 2.333815 1.917306 3.525612
## F 2.368385 2.652735 3.112808
## G 2.624223 3.059308 2.917372
## H 2.851292 2.755371 3.364408
## I 2.931033 3.072902 3.464446
## J 3.027722 2.899132 3.366387
## K 3.035856 3.463041 3.092368
## L 3.180211 2.238550 3.301345
## M 3.238953 2.558549 3.225307
## N 3.259976 1.723714 2.641809
## O 3.262605 3.580004 2.657079
## P 3.339634 3.540232 3.227204
## Q 3.632080 3.349932 3.112958
## R 3.653154 3.283649 2.703744
## S 3.783546 3.916703 3.213564
## T 3.935576 4.046783 3.341507
## U 4.037169 3.180505 3.264513
## V 4.291665 3.100641 2.648084
## W 4.333509 3.990103 2.874010
## X 4.350045 3.241618 3.293307
## Y 4.724357 3.747172 3.090894
## Z 5.288439 3.679698 2.722883

If we fit a Brownian model using geiger's fitContinuous function we will find that all three have identical Brownian rates, ancestral state values, and even likelihoods!

fitContinuous(tree,x)
## GEIGER-fitted comparative model of continuous data
##  fitted 'BM' model parameters:
##  sigsq = 0.499997
##  z0 = 3.000000
## 
##  model summary:
##  log-likelihood = -16.681907
##  AIC = 37.363814
##  AICc = 37.885553
##  free parameters = 2
## 
## Convergence diagnostics:
##  optimization iterations = 100
##  failed iterations = 0
##  frequency of best fit = 1.00
## 
##  object summary:
##  'lik' -- likelihood function
##  'bnd' -- bounds for likelihood search
##  'res' -- optimization iteration summary
##  'opt' -- maximum likelihood parameter estimates
fitContinuous(tree,y)
## GEIGER-fitted comparative model of continuous data
##  fitted 'BM' model parameters:
##  sigsq = 0.499988
##  z0 = 3.000002
## 
##  model summary:
##  log-likelihood = -16.681672
##  AIC = 37.363344
##  AICc = 37.885083
##  free parameters = 2
## 
## Convergence diagnostics:
##  optimization iterations = 100
##  failed iterations = 0
##  frequency of best fit = 1.00
## 
##  object summary:
##  'lik' -- likelihood function
##  'bnd' -- bounds for likelihood search
##  'res' -- optimization iteration summary
##  'opt' -- maximum likelihood parameter estimates
fitContinuous(tree,z)
## GEIGER-fitted comparative model of continuous data
##  fitted 'BM' model parameters:
##  sigsq = 0.500000
##  z0 = 3.000000
## 
##  model summary:
##  log-likelihood = -16.681969
##  AIC = 37.363938
##  AICc = 37.885677
##  free parameters = 2
## 
## Convergence diagnostics:
##  optimization iterations = 100
##  failed iterations = 0
##  frequency of best fit = 1.00
## 
##  object summary:
##  'lik' -- likelihood function
##  'bnd' -- bounds for likelihood search
##  'res' -- optimization iteration summary
##  'opt' -- maximum likelihood parameter estimates

However, if we visualize our data we shall see that they show strikingly different patterns. To do this I'll use the function phenogram from the phytools package:

par(mfrow=c(1,3),mar=c(5.1,4.1,1.1,1.1),oma=c(0,0,3,0))
labels<-setNames(seq(min(c(x,y,z)),max(c(x,y,z)),
    by=diff(range(c(x,y,z)))/(Ntip(tree)-1)),
    names(sort(x)))
phenogram(tree,x,spread.cost=c(1,0),ylim=range(c(x,y,z)),
    label.pos=labels,colors="lightblue",fsize=1.2,
    ylab="phenotype (x)")
labels<-setNames(seq(min(c(x,y,z)),max(c(x,y,z)),
    by=diff(range(c(x,y,z)))/(Ntip(tree)-1)),
    names(sort(y)))
phenogram(tree,y,spread.cost=c(1,0),ylim=range(c(x,y,z)),
    label.pos=labels,colors="lightgreen",fsize=1.2,
    ylab="phenotype (y)")
labels<-setNames(seq(min(c(x,y,z)),max(c(x,y,z)),
    by=diff(range(c(x,y,z)))/(Ntip(tree)-1)),
    names(sort(z)))
phenogram(tree,z,spread.cost=c(1,0),ylim=range(c(x,y,z)),
    label.pos=labels,colors="orange",fsize=1.2,
    ylab="phenotype (z)")
mtext("traitgrams for x, y, & z",outer=TRUE,cex=1.5,line=0.5)

plot of chunk unnamed-chunk-5

In one case (x), we have data that are highly phylogenetically structured, at the other extreme (z), our data seem to have very little phylogenetic structure at all, and in the middle (y) we see an intermediate degree of phylogenetic patterning.

We'll get the same impression if we use a contMap style plot:

par(mfrow=c(1,3),oma=c(0,0,3,0))
obj<-contMap(tree,x,plot=FALSE)
plot(obj,xlim=c(-0.1,1.2),leg.txt="x",fsize=1.2)
obj<-contMap(tree,y,plot=FALSE)
plot(obj,xlim=c(-0.1,1.2),leg.txt="y",fsize=1.2)
obj<-contMap(tree,z,plot=FALSE)
plot(obj,xlim=c(-0.1,1.2),leg.txt="z",fsize=1.2)
mtext("contMap plots for x, y, & z",outer=TRUE,cex=1.5,line=0.5)

plot of chunk unnamed-chunk-6

Unlike the previous plot, here I have not used a consistent trait scale to make the lack of phylogenetic structure in z more apparent.

Of course, if we measured even a relatively simple diagnostic such as phylogenetic signal for the two datasets we would find that they are not at all alike!

phylosig(tree,x)
## [1] 2.077251
phylosig(tree,y)
## [1] 1.013159
phylosig(tree,z)
## [1] 0.207328

You too can generate your own phylogenetic Anscombe datasets. I made mine as follows:

tree<-pbtree(n=26,scale=1,tip.label=LETTERS)
x<-setNames(sort(fastBM(tree)),tree$tip.label)
y<-fastBM(tree,a=3,sig2=0.5)
z<-setNames(sample(fastBM(tree)),sample(tree$tip.label))[tree$tip.label]

foo<-function(theta,sig2=1,a=0,y,C){
    y<-y/theta[1]
    y<-y-theta[2]
    N<-nrow(C)
    obj<-phyl.vcv(as.matrix(y),C,1)
    (obj$R[1,1]*(N-1)/N-sig2)^2+(obj$alpha[1,1]-a)^2
}

fit<-optim(c(1,0),foo,sig2=0.5,a=3.0,y=x,C=vcv(tree),method="L-BFGS-B",
    lower=c(1e-12,-Inf),upper=rep(Inf,2))

x<-x/fit$par[1]-fit$par[2]

fit<-optim(c(1,0),foo,sig2=0.5,a=3.0,y=y,C=vcv(tree),method="L-BFGS-B",
    lower=c(1e-12,-Inf),upper=rep(Inf,2))

y<-y/fit$par[1]-fit$par[2]

fit<-optim(c(1,0),foo,sig2=0.5,a=3.0,y=z,C=vcv(tree),method="L-BFGS-B",
    lower=c(1e-12,-Inf),upper=rep(Inf,2))

z<-z/fit$par[1]-fit$par[2]

In the code above, I simulated data via Brownian motion, then either sorted them by the tip order (x), did nothing (y), or randomized them among tips (z), before recentering & rescaling each data vector to have the desired ML σ2 and ancestral state values.

That's it!