Monday, May 23, 2016

Estimating the evolutionary correlation between a multistate discrete character and a continuous trait under the threshold model: An ad hoc approah

Several people have expressed interest in fitting a correlational threshold model in which one or more of the threshold traits are multistate rather than binary. Such a trait would have to be ordered, in which subsequent states represent subsequent changes in liability across a series of two or more thresholds.

The main difficulty that this poses is that for multistate characters we must also estimate the relative positions of the thresholds. Unfortunately this is not straightforward. One reasonable option that I suggested in my 2014 paper is that the liabilities at the tips of the tree, the correlations between traits, and the threshold positions, could be sampled from their joint posterior distribution using MCMC. Unfortunately, no such approach has yet been implemented in R or other software.

What has been developed, though, is a function (ancThresh) in my package that can sample tip and node liabilites, and the positions of two or more thresholds, for a single trait using MCMC, primarily to the end of estimating ancestral character values under the model for that trait. One ad hoc, though creative, tactic that I have seen employed recently has been to first sample tip & node liabilities, and then use these liabilites in a standard phylogenetic correlation analysis. Though this seems quite reasonable, to my knowledge no one has yet explored the statistical properties of the method. This isn't hard to do, however, so I though I'd try to do it here.

To start with, we can load some packages:

library(phytools)
library(plotrix)
library(coda)

Next, let's simulate a tree & some data. Here, I'll simulate no correlation (r=0) between the threshold trait and a second, continuous character:

## simulate tree
tree<-pbtree(n=26,scale=1,tip.label=LETTERS)
## simulated correlated data for two characters
r<-0
VCV<-matrix(c(1,r*sqrt(2),r*sqrt(2),2),2,2)
X<-sim.corrs(tree,VCV)
X[,1]<-X[,1]-mean(X[,1]) ## recenter character 1
## threshold trait 1
x<-sapply(X[,1],threshState,thresholds=setNames(c(0,1,Inf),
    letters[1:3]))
x
##   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O   P   Q   R 
## "c" "c" "c" "b" "a" "a" "a" "c" "a" "a" "a" "a" "a" "a" "a" "a" "a" "b" 
##   S   T   U   V   W   X   Y   Z 
## "b" "b" "b" "a" "a" "b" "b" "b"
y<-X[,2]
y
##           A           B           C           D           E           F 
## -1.67106581 -1.38254877 -0.98622363 -0.66341721 -0.95826839 -0.33150388 
##           G           H           I           J           K           L 
##  1.02161990  0.04758419 -0.16356017 -0.05162870  0.34089327 -0.08999551 
##           M           N           O           P           Q           R 
##  0.06010288  0.61015394  0.27364931  0.58195450  0.48776407  0.63435560 
##           S           T           U           V           W           X 
## -0.13646326  0.77393761  0.71105093  0.65765914  1.23746620  1.00154033 
##           Y           Z 
##  1.09755746  1.32505288

Now that is done, we can go ahead & sample tip & node liabilities, and the positions of the thresholds (actually, only one threshold, that between b & c, because the threshold between a and b is fixed) from their joint posterior distribution using Bayesian MCMC. This is for our single thresholded trait.

mcmc<-ancThresh(tree,x,ngen=200000,sequence=letters[1:3],
    control=list(plot=FALSE,print=FALSE))
## MCMC starting....

Now that is complete, let's first compare the mean liabilities for the thresholded character to the underlying true liabilities, merely because they happen to be known in this case:

hpd<-function(x) HPDinterval(as.mcmc(x))
colors<-setNames(c("blue","red","green"),letters[1:3])
plotCI(X[,1],colMeans(mcmc$liab[21:201,tree$tip.label]),pch=19,
    ui=apply(mcmc$liab[21:201,tree$tip.label],2,hpd)[2,],
    li=apply(mcmc$liab[21:201,tree$tip.label],2,hpd)[1,],
    xlab="true liabilities",ylab="estimated liabilities (& HPDs)",
    col=colors[x],scol="black")
lines(par()$usr[1:2],par()$usr[1:2],lty="dashed",lwd=2,col="red")
abline(v=0,lty="dotted")
abline(h=0,lty="dotted")

plot of chunk unnamed-chunk-4

Having done this we see that most of the true liability values, which we would not normally know of course, fall with the 95% HPD interval for the estimates.

Next, we can obtain a posterior distribution for the correlation between x & y. This is what we set out after in the first place, recall. Here what I'll do - instead of computing a single correlation from the posterior means - is compute a separate correlation from each post burn-in sample that I have from the posterior. Remember that I must take phylogeny into account when doing so!

foo<-function(x,y,tree)
    cov2cor(phyl.vcv(cbind(x,y),vcv(tree),1)$R)[1,2]
r.mcmc<-apply(mcmc$liab[21:201,tree$tip.label],1,foo,y=X[,2],tree=tree)
pd.r<-density(r.mcmc)
pd.r
## 
## Call:
##  density.default(x = r.mcmc)
## 
## Data: r.mcmc (181 obs.); Bandwidth 'bw' = 0.04539
## 
##        x                  y            
##  Min.   :-0.57385   Min.   :0.0005589  
##  1st Qu.:-0.31506   1st Qu.:0.0969140  
##  Median :-0.05627   Median :0.6718680  
##  Mean   :-0.05627   Mean   :0.9650849  
##  3rd Qu.: 0.20251   3rd Qu.:1.9282196  
##  Max.   : 0.46130   Max.   :2.5189292
plot(pd.r,main="posterior density on the correlation",xlab="r")
abline(v=0,lty="dashed",lwd=2,col="red")
text(x=0,y=0.1,"generating value of r",pos=4)

plot of chunk unnamed-chunk-5

ci<-HPDinterval(as.mcmc(r.mcmc))
ci
##           lower     upper
## var1 -0.3705042 0.1758467
## attr(,"Probability")
## [1] 0.9502762

Of course, we would need to duplicate this a large number of times (but due the computationally instensive nature of the MCMC, we can start with 20) to get a general sense of the overall statistical properties of this approach. Let's try it:

nrep<-20
ci<-ci.cov<-matrix(NA,nrep,2)
R<-R.mean<-Cov<-Cov.mean<-vector()
mcmc<-vector(mode="list",length=nrep)
r<-0
VCV<-matrix(c(1,r*sqrt(2),r*sqrt(2),2),2,2)
corp<-function(x,y,tree)
    cov2cor(phyl.vcv(cbind(x,y),vcv(tree),1)$R)[1,2]
covp<-function(x,y,tree)
    phyl.vcv(cbind(x,y),vcv(tree),1)$R[1,2]
trees<-pbtree(n=26,scale=1,tip.label=LETTERS,nsim=nrep)
for(i in 1:nrep){
    ## simulation
    X<-sim.corrs(trees[[i]],VCV)
    X[,1]<-X[,1]-mean(X[,1])
    x<-sapply(X[,1],threshState,thresholds=setNames(c(0,1,Inf),
        letters[1:3]))
    ## MCMC
    mcmc[[i]]<-ancThresh(trees[[i]],x,ngen=200000,sequence=letters[1:3],
        control=list(plot=FALSE,print=FALSE))
    r.mcmc<-apply(mcmc[[i]]$liab[21:201,trees[[i]]$tip.label],1,corp,
        y=X[,2],tree=trees[[i]])
    R[i]<-mean(r.mcmc)
    ci[i,]<-HPDinterval(as.mcmc(r.mcmc))
    ## calculate a couple of other things here for comparison
    R.mean[i]<-corp(colMeans(mcmc[[i]]$liab[21:201,
        trees[[i]]$tip.label]),X[,2],trees[[i]])
    cov.mcmc<-apply(mcmc[[i]]$liab[21:201,trees[[i]]$tip.label],1,covp,
        y=X[,2],tree=trees[[i]])
    Cov[i]<-mean(cov.mcmc)
    ci.cov[i,]<-HPDinterval(as.mcmc(cov.mcmc))
    Cov.mean<-covp(colMeans(mcmc[[i]]$liab[21:201,
        trees[[i]]$tip.label]),X[,2],trees[[i]])
    cat(paste("------------------------ Done replicate",i,"\n"))
}
## MCMC starting....
## ------------------------ Done replicate 1
## MCMC starting....
## ------------------------ Done replicate 2
## MCMC starting....
## ------------------------ Done replicate 3
## MCMC starting....
## ------------------------ Done replicate 4
## MCMC starting....
## ------------------------ Done replicate 5
## MCMC starting....
## ------------------------ Done replicate 6
## MCMC starting....
## ------------------------ Done replicate 7
## MCMC starting....
## ------------------------ Done replicate 8
## MCMC starting....
## ------------------------ Done replicate 9
## MCMC starting....
## ------------------------ Done replicate 10
## MCMC starting....
## ------------------------ Done replicate 11
## MCMC starting....
## ------------------------ Done replicate 12
## MCMC starting....
## ------------------------ Done replicate 13
## MCMC starting....
## ------------------------ Done replicate 14
## MCMC starting....
## ------------------------ Done replicate 15
## MCMC starting....
## ------------------------ Done replicate 16
## MCMC starting....
## ------------------------ Done replicate 17
## MCMC starting....
## ------------------------ Done replicate 18
## MCMC starting....
## ------------------------ Done replicate 19
## MCMC starting....
## ------------------------ Done replicate 20

Having done that, we can then take a look at whether our mean correlation across replicates is more or less centered on the generating value of r = 0, and similarly what fraction of the 95% HPDs for the correlation include this generating value. We think it should be around 19/20 or 95%.

R
##  [1] -0.048566633 -0.005317359  0.146492590  0.119758935  0.041327768
##  [6] -0.162323365  0.152224005 -0.143100612  0.043652399  0.107146396
## [11]  0.327710842 -0.094010491  0.195777668 -0.001286759 -0.147711651
## [16]  0.448247724 -0.094734121 -0.042752183 -0.269427398 -0.164817066
mean(R)
## [1] 0.02041453
ci
##              [,1]         [,2]
##  [1,] -0.27823664  0.166015861
##  [2,] -0.31197924  0.324153803
##  [3,] -0.15500085  0.362459410
##  [4,] -0.06140886  0.359871967
##  [5,] -0.15890202  0.249944995
##  [6,] -0.39634111  0.166358895
##  [7,] -0.11452389  0.385336731
##  [8,] -0.32451440  0.049618284
##  [9,] -0.17915762  0.338527915
## [10,] -0.09885426  0.299924390
## [11,]  0.13334608  0.552577114
## [12,] -0.33714016  0.103557367
## [13,]  0.01927001  0.392790224
## [14,] -0.30864220  0.254575236
## [15,] -0.33681573  0.085616931
## [16,]  0.26108001  0.606684937
## [17,] -0.35611455  0.191008778
## [18,] -0.30745918  0.161326103
## [19,] -0.46888144 -0.006236106
## [20,] -0.46673940  0.050538976
mean((ci[,1]<0)*(ci[,2]>0)==1)
## [1] 0.8
## or
plotCI(x=R,y=1:nrep,ui=ci[,2],li=ci[,1],err="x",xlab="HPD(r)",
    ylab="replicate",xlim=c(-1,1),pch=19)
abline(v=r,lty="dotted")

plot of chunk unnamed-chunk-7

Since we also computed the mean covariances & the correlations and covariance from the posterior mean liabilities for each run, let's compare these:

R.mean
##  [1] -0.059078345 -0.010050969  0.185637260  0.151141259  0.052384390
##  [6] -0.217800815  0.230801440 -0.176791006  0.056176701  0.144713388
## [11]  0.422623169 -0.128424671  0.262764952  0.002071499 -0.189752250
## [16]  0.555325822 -0.137313778 -0.057192992 -0.344757074 -0.206130921
mean(R.mean)
## [1] 0.02681735
Cov
##  [1] -0.057641172 -0.008404547  0.166538292  0.208596920  0.039513951
##  [6] -0.226406068  0.240540639 -0.208357788  0.068088427  0.142210715
## [11]  0.478902164 -0.139405998  0.243998991  0.002606433 -0.217982679
## [16]  0.678894959 -0.159555319 -0.071439376 -0.406861207 -0.226330470
mean(Cov)
## [1] 0.02737534
ci.cov
##               [,1]       [,2]
##  [1,] -0.278687429  0.2653090
##  [2,] -0.420817922  0.3900768
##  [3,] -0.130056443  0.4623751
##  [4,] -0.150981029  0.5757184
##  [5,] -0.140800003  0.2304954
##  [6,] -0.601914477  0.1569817
##  [7,] -0.141190450  0.6656397
##  [8,] -0.512008852  0.0871602
##  [9,] -0.334170232  0.4869676
## [10,] -0.133863757  0.3928547
## [11,]  0.095436409  0.7851570
## [12,] -0.508154395  0.1572361
## [13,]  0.004246627  0.4784923
## [14,] -0.526237496  0.4009590
## [15,] -0.489581250  0.1587820
## [16,]  0.420823663  0.9959279
## [17,] -0.638616064  0.3115659
## [18,] -0.450950663  0.3122699
## [19,] -0.813975100 -0.1000418
## [20,] -0.564341396  0.1422885
mean((ci.cov[,1]<0)*(ci.cov[,2]>0)==1)
## [1] 0.8
## or
plotCI(x=Cov,y=1:nrep,ui=ci.cov[,2],li=ci.cov[,1],
    err="x",xlab="HPD(cov)",ylab="replicate",
    xlim=c(-2,2),pch=19)
abline(v=VCV[1,2],lty="dotted")

plot of chunk unnamed-chunk-8

Next, we can try with a correlation between x and y. For this, all I have to do is duplicate the above but change the correlation to some non-zero value. Here I have chosen r = -0.7, but this is totally arbitrary!

nrep<-20
ci<-matrix(NA,nrep,2)
R<-vector()
mcmc<-vector(mode="list",length=20)
r<--0.7
VCV<-matrix(c(1,r*sqrt(2),r*sqrt(2),2),2,2)
for(i in 1:nrep){
    ## simulation
    X<-sim.corrs(trees[[i]],VCV)
    X[,1]<-X[,1]-mean(X[,1])
    x<-sapply(X[,1],threshState,thresholds=setNames(c(0,1,Inf),
        letters[1:3]))
    y<-X[,2]
    ## MCMC
    mcmc[[i]]<-ancThresh(trees[[i]],x,ngen=200000,sequence=letters[1:3],
        control=list(plot=FALSE,print=FALSE))
    r.mcmc<-apply(mcmc[[i]]$liab[21:201,trees[[i]]$tip.label],1,corp,
        y=X[,2],tree=trees[[i]])
    R[i]<-mean(r.mcmc)
    ci[i,]<-HPDinterval(as.mcmc(r.mcmc))
    R.mean[i]<-corp(colMeans(mcmc[[i]]$liab[21:201,
        trees[[i]]$tip.label]),X[,2],trees[[i]])
    cov.mcmc<-apply(mcmc[[i]]$liab[21:201,trees[[i]]$tip.label],1,covp,
        y=X[,2],tree=trees[[i]])
    Cov[i]<-mean(cov.mcmc)
    ci.cov[i,]<-HPDinterval(as.mcmc(cov.mcmc))
    Cov.mean<-covp(colMeans(mcmc[[i]]$liab[21:201,
        trees[[i]]$tip.label]),X[,2],trees[[i]])
    cat(paste("------------------------ Done replicate",i,"\n"))
}
## MCMC starting....
## ------------------------ Done replicate 1
## MCMC starting....
## ------------------------ Done replicate 2
## MCMC starting....
## ------------------------ Done replicate 3
## MCMC starting....
## ------------------------ Done replicate 4
## MCMC starting....
## ------------------------ Done replicate 5
## MCMC starting....
## ------------------------ Done replicate 6
## MCMC starting....
## ------------------------ Done replicate 7
## MCMC starting....
## ------------------------ Done replicate 8
## MCMC starting....
## ------------------------ Done replicate 9
## MCMC starting....
## ------------------------ Done replicate 10
## MCMC starting....
## ------------------------ Done replicate 11
## MCMC starting....
## ------------------------ Done replicate 12
## MCMC starting....
## ------------------------ Done replicate 13
## MCMC starting....
## ------------------------ Done replicate 14
## MCMC starting....
## ------------------------ Done replicate 15
## MCMC starting....
## ------------------------ Done replicate 16
## MCMC starting....
## ------------------------ Done replicate 17
## MCMC starting....
## ------------------------ Done replicate 18
## MCMC starting....
## ------------------------ Done replicate 19
## MCMC starting....
## ------------------------ Done replicate 20
R
##  [1] -0.4442685 -0.3329273 -0.2039185  0.1561997 -0.4330033 -0.5425256
##  [7] -0.4794049 -0.4291188 -0.2694266 -0.3460377 -0.4258567 -0.4981087
## [13] -0.5232131 -0.5044816 -0.2361279 -0.3941872 -0.1844598 -0.4250402
## [19] -0.3504278 -0.5753514
mean(R)
## [1] -0.3720843
ci
##             [,1]         [,2]
##  [1,] -0.6461571 -0.188916059
##  [2,] -0.5258120 -0.095176138
##  [3,] -0.4636199  0.036734499
##  [4,] -0.1420690  0.408757105
##  [5,] -0.6239437 -0.198852437
##  [6,] -0.7197192 -0.295896109
##  [7,] -0.6642045 -0.274124877
##  [8,] -0.6582805 -0.207382961
##  [9,] -0.5618896  0.007048045
## [10,] -0.5650661 -0.125165941
## [11,] -0.5975407 -0.224868117
## [12,] -0.7031133 -0.248217990
## [13,] -0.6752238 -0.357285876
## [14,] -0.6738202 -0.325900549
## [15,] -0.5726131 -0.015487077
## [16,] -0.5936548 -0.150543085
## [17,] -0.4405038  0.037160021
## [18,] -0.6518851 -0.209177296
## [19,] -0.6380389 -0.039961077
## [20,] -0.7346678 -0.406110586
mean((ci[,1]<r)*(ci[,2]>r)==1)
## [1] 0.15
## or
plotCI(x=R,y=1:nrep,ui=ci[,2],li=ci[,1],err="x",xlab="HPD(r)",
    ylab="replicate",xlim=c(-1,1),pch=19)
abline(v=r,lty="dotted")

plot of chunk unnamed-chunk-9

R.mean
##  [1] -0.5845377 -0.4564454 -0.2887876  0.2126492 -0.5455727 -0.7034432
##  [7] -0.6610401 -0.5528324 -0.3826757 -0.4268218 -0.5151527 -0.6099373
## [13] -0.6110249 -0.6154331 -0.3824219 -0.5115940 -0.2494795 -0.5614771
## [19] -0.5375527 -0.7291751
mean(R.mean)
## [1] -0.4856378
Cov
##  [1] -0.5333285 -0.5084913 -0.2576849  0.1644342 -0.6660364 -0.9119713
##  [7] -0.8007782 -0.6881301 -0.4577610 -0.5032624 -0.5714295 -0.7532690
## [13] -0.9704486 -0.7441925 -0.3652020 -0.5726135 -0.2391976 -0.6613127
## [19] -0.4382249 -0.7324093
mean(Cov)
## [1] -0.5605655
ci.cov
##             [,1]        [,2]
##  [1,] -0.8027139 -0.23520232
##  [2,] -0.8881685 -0.13503644
##  [3,] -0.5809815  0.06139351
##  [4,] -0.1515432  0.44017473
##  [5,] -1.0456683 -0.31672684
##  [6,] -1.3254044 -0.46230969
##  [7,] -1.1891047 -0.39266423
##  [8,] -1.0838651 -0.30619133
##  [9,] -0.9962171  0.01140984
## [10,] -0.8123446 -0.10359772
## [11,] -0.8514475 -0.30431646
## [12,] -1.1551143 -0.38880466
## [13,] -1.3674045 -0.55207820
## [14,] -1.0904980 -0.40602170
## [15,] -0.8448122  0.01013859
## [16,] -0.8779924 -0.19469199
## [17,] -0.5146207  0.10636017
## [18,] -1.0223800 -0.25280889
## [19,] -0.7684456  0.06299185
## [20,] -0.9752561 -0.43084719
mean((ci.cov[,1]<VCV[1,2])*(ci.cov[,2]>VCV[1,2])==1)
## [1] 0.45
## or
plotCI(x=Cov,y=1:nrep,ui=ci.cov[,2],li=ci.cov[,1],
    err="x",xlab="HPD(cov)",ylab="replicate",
    xlim=c(-2,2),pch=19)
abline(v=VCV[1,2],lty="dotted")

plot of chunk unnamed-chunk-9

This is a very small experiment, but it at least suggests that this tactic probably does not result in vastly inflated type I error under the null hypothesis. Furthermore, it may not be too bad at estimating the evolutionary correlation - although better with the mean from the posterior (than the mean correlation computed from each sample), and better still with the covariance. (This latter result is not too surprising as we have thrown out most of the information about the evolutionary correlation by thresholding our character trait, so this will probably bias our correlation towards zero.)

One thing perhaps worth noting is that I have done nothing here to assess genuine convergence to the posterior distribution of my Bayesian MCMC runs. A proper analysis of the statistical properties of the method would necessitate this. Furthermore, a failure to converge could result in HPDs that are both too broad or too narrow, as well as systematic bias in the mean correlation from the posterior sample.

Perhaps more on this later….

No comments:

Post a Comment