Sunday, April 12, 2020

More on using threshBayes to measure the correlation between two binary traits in phytools

As I posted yesterday I've been doing a little bit of work updating the phytools function threshBayes.

threshBayes uses Bayesian MCMC to fit a correlational threshold model to two traits: one continuous & one binary trait; or two binary traits. In this way it is much more limited than Joe Felsenstein's THRESHML program, which can be used to analyze an arbitrary number of binary and continuous traits. (THRESHML can be called in R using Rphylip, but it still needs to be installed locally.)

The key update, as I mentioned in my previous post, is tracking the acceptance rate for all the variables in the model, and then updating the proposal distributions progressively to try and hit a user-specified target rate that defaults to 0.234.

Given the very high parameter space of the model, the MCMC is still liable to be relatively inefficient. Nonetheless, this update seems to help a bit with convergence to the posterior distribution.

It is also relatively easy to show that threshBayes actually works fairly well - that is, that when we have a a pretty big tree (100 tips or more) we can usually recover the generating correlation coefficient of a simulation for either two binary traits or one binary & one continuous trait most of the time.

Here's an example. Note that I'm going to simulate a generating correlation coefficient between the liabilities of my two binary traits of r = -0.8. Let's see if we can get that back!

tree<-pbtree(n=100)
tree
## 
## Phylogenetic tree with 100 tips and 99 internal nodes.
## 
## Tip labels:
##  t73, t74, t95, t96, t40, t59, ...
## 
## Rooted; includes branch lengths.
Y<-as.data.frame(sim.corrs(tree,
    V<-matrix(c(
    1.0,-0.8,
    -0.8,1.0),
    2,2)))
V
##      [,1] [,2]
## [1,]  1.0 -0.8
## [2,] -0.8  1.0
Y[[1]]<-as.factor(threshState(Y[[1]],
    setNames(c(0,Inf),c("a","b"))))
Y[[2]]<-as.factor(threshState(Y[[2]],
    setNames(c(0,Inf),c("c","d"))))
Y
##      V1 V2
## t73   a  d
## t74   a  d
## t95   b  d
## t96   b  d
## t40   a  d
## t59   b  c
## t60   b  c
## t91   b  c
## t92   b  c
## t56   a  d
## t57   a  d
## t28   b  c
## t31   b  c
## t58   b  c
## t87   b  c
## t88   b  c
## t43   a  c
## t93   b  c
## t94   b  c
## t89   b  c
## t90   b  c
## t71   b  c
## t72   a  c
## t35   a  d
## t44   a  d
## t50   a  d
## t52   a  d
## t85   a  d
## t86   a  d
## t12   b  c
## t45   b  c
## t46   b  c
## t11   a  c
## t27   b  c
## t54   a  c
## t55   b  c
## t62   b  c
## t63   a  c
## t37   b  c
## t21   b  c
## t79   b  c
## t80   b  c
## t19   b  c
## t69   b  c
## t70   b  c
## t36   b  c
## t9    a  c
## t33   b  c
## t34   b  c
## t2    b  d
## t38   b  c
## t39   b  c
## t67   b  c
## t68   b  c
## t53   b  c
## t99   b  c
## t100  b  c
## t26   a  d
## t81   a  d
## t82   a  d
## t5    b  c
## t8    a  d
## t97   a  d
## t98   a  d
## t29   b  c
## t30   b  c
## t23   b  d
## t24   b  d
## t25   a  d
## t47   a  d
## t75   a  d
## t76   a  d
## t6    b  c
## t7    a  d
## t13   a  c
## t14   b  c
## t64   a  d
## t65   a  c
## t66   a  c
## t48   a  c
## t4    a  c
## t1    a  c
## t3    a  d
## t10   b  c
## t32   b  c
## t41   b  c
## t42   b  c
## t22   b  c
## t77   a  c
## t78   a  c
## t61   a  d
## t51   a  c
## t20   a  d
## t16   b  c
## t83   a  d
## t84   a  d
## t49   a  d
## t17   b  d
## t18   a  d
## t15   b  c

Now, let's run our MCMC - for 10,000,000 generations (so I'm going to set plot=FALSE):

mcmc<-threshBayes(tree,Y,ngen=10000000,plot=FALSE,
    control=list(print.interval=100000))
## Starting MCMC....
## genearation: 100000; mean acceptance rate: 0.26
## genearation: 200000; mean acceptance rate: 0.31
## genearation: 300000; mean acceptance rate: 0.24
## genearation: 400000; mean acceptance rate: 0.29
## genearation: 500000; mean acceptance rate: 0.32
## genearation: 600000; mean acceptance rate: 0.24
## genearation: 700000; mean acceptance rate: 0.17
## genearation: 800000; mean acceptance rate: 0.23
## genearation: 900000; mean acceptance rate: 0.14
## genearation: 1000000; mean acceptance rate: 0.32
## genearation: 1100000; mean acceptance rate: 0.21
## genearation: 1200000; mean acceptance rate: 0.3
## genearation: 1300000; mean acceptance rate: 0.3
## genearation: 1400000; mean acceptance rate: 0.2
## genearation: 1500000; mean acceptance rate: 0.25
## genearation: 1600000; mean acceptance rate: 0.19
## genearation: 1700000; mean acceptance rate: 0.23
## genearation: 1800000; mean acceptance rate: 0.26
## genearation: 1900000; mean acceptance rate: 0.16
## genearation: 2000000; mean acceptance rate: 0.25
## genearation: 2100000; mean acceptance rate: 0.15
## genearation: 2200000; mean acceptance rate: 0.19
## genearation: 2300000; mean acceptance rate: 0.24
## genearation: 2400000; mean acceptance rate: 0.23
## genearation: 2500000; mean acceptance rate: 0.23
## genearation: 2600000; mean acceptance rate: 0.34
## genearation: 2700000; mean acceptance rate: 0.28
## genearation: 2800000; mean acceptance rate: 0.23
## genearation: 2900000; mean acceptance rate: 0.27
## genearation: 3000000; mean acceptance rate: 0.38
## genearation: 3100000; mean acceptance rate: 0.15
## genearation: 3200000; mean acceptance rate: 0.2
## genearation: 3300000; mean acceptance rate: 0.25
## genearation: 3400000; mean acceptance rate: 0.34
## genearation: 3500000; mean acceptance rate: 0.29
## genearation: 3600000; mean acceptance rate: 0.15
## genearation: 3700000; mean acceptance rate: 0.18
## genearation: 3800000; mean acceptance rate: 0.23
## genearation: 3900000; mean acceptance rate: 0.33
## genearation: 4000000; mean acceptance rate: 0.2
## genearation: 4100000; mean acceptance rate: 0.23
## genearation: 4200000; mean acceptance rate: 0.21
## genearation: 4300000; mean acceptance rate: 0.25
## genearation: 4400000; mean acceptance rate: 0.26
## genearation: 4500000; mean acceptance rate: 0.28
## genearation: 4600000; mean acceptance rate: 0.23
## genearation: 4700000; mean acceptance rate: 0.22
## genearation: 4800000; mean acceptance rate: 0.21
## genearation: 4900000; mean acceptance rate: 0.19
## genearation: 5000000; mean acceptance rate: 0.19
## genearation: 5100000; mean acceptance rate: 0.31
## genearation: 5200000; mean acceptance rate: 0.22
## genearation: 5300000; mean acceptance rate: 0.27
## genearation: 5400000; mean acceptance rate: 0.22
## genearation: 5500000; mean acceptance rate: 0.25
## genearation: 5600000; mean acceptance rate: 0.25
## genearation: 5700000; mean acceptance rate: 0.19
## genearation: 5800000; mean acceptance rate: 0.22
## genearation: 5900000; mean acceptance rate: 0.23
## genearation: 6000000; mean acceptance rate: 0.26
## genearation: 6100000; mean acceptance rate: 0.31
## genearation: 6200000; mean acceptance rate: 0.34
## genearation: 6300000; mean acceptance rate: 0.14
## genearation: 6400000; mean acceptance rate: 0.26
## genearation: 6500000; mean acceptance rate: 0.3
## genearation: 6600000; mean acceptance rate: 0.29
## genearation: 6700000; mean acceptance rate: 0.28
## genearation: 6800000; mean acceptance rate: 0.21
## genearation: 6900000; mean acceptance rate: 0.24
## genearation: 7000000; mean acceptance rate: 0.23
## genearation: 7100000; mean acceptance rate: 0.39
## genearation: 7200000; mean acceptance rate: 0.35
## genearation: 7300000; mean acceptance rate: 0.36
## genearation: 7400000; mean acceptance rate: 0.24
## genearation: 7500000; mean acceptance rate: 0.27
## genearation: 7600000; mean acceptance rate: 0.21
## genearation: 7700000; mean acceptance rate: 0.31
## genearation: 7800000; mean acceptance rate: 0.27
## genearation: 7900000; mean acceptance rate: 0.23
## genearation: 8000000; mean acceptance rate: 0.35
## genearation: 8100000; mean acceptance rate: 0.18
## genearation: 8200000; mean acceptance rate: 0.25
## genearation: 8300000; mean acceptance rate: 0.19
## genearation: 8400000; mean acceptance rate: 0.3
## genearation: 8500000; mean acceptance rate: 0.34
## genearation: 8600000; mean acceptance rate: 0.3
## genearation: 8700000; mean acceptance rate: 0.32
## genearation: 8800000; mean acceptance rate: 0.19
## genearation: 8900000; mean acceptance rate: 0.25
## genearation: 9000000; mean acceptance rate: 0.16
## genearation: 9100000; mean acceptance rate: 0.27
## genearation: 9200000; mean acceptance rate: 0.24
## genearation: 9300000; mean acceptance rate: 0.22
## genearation: 9400000; mean acceptance rate: 0.14
## genearation: 9500000; mean acceptance rate: 0.22
## genearation: 9600000; mean acceptance rate: 0.21
## genearation: 9700000; mean acceptance rate: 0.35
## genearation: 9800000; mean acceptance rate: 0.2
## genearation: 9900000; mean acceptance rate: 0.25
## genearation: 10000000; mean acceptance rate: 0.19
## Done MCMC.
mcmc
## 
## Object of class "threshBayes" consisting of a matrix (L) of
## sampled liabilities for the tips of the tree & a second matrix
## (par) with the sample model parameters & correlation.
## 
## Mean correlation (r) from the posterior sample is: -0.63687.
## 
## Ordination of discrete traits:
## 
##  Trait 1: a <-> b
##  Trait 2: c <-> d
plot(mcmc)

plot of chunk unnamed-chunk-2

Lastly, let's visualize our posterior density for r. Remember, the simulated value was r = -0.8.

plot(density(mcmc))

plot of chunk unnamed-chunk-3

Neat.

3 comments:

  1. Hi Liam,
    thank you for the code and very helpful examples. The code itself seems to be not very complicated (for my data:w<-threshBayes(tree,data,types=c("disc","disc"),ngen=100000,control=list(sample=sample))), however, I think that I am making some mistake. All my results are 0, eg.
    Starting MCMC....
    genearation: 1000; mean acceptance rate: 0
    genearation: 2000; mean acceptance rate: 0
    genearation: 3000; mean acceptance rate: 0

    And:
    Mean correlation (r) from the posterior sample is: 0.
    And
    > effectiveSize(rA)
    var1
    0
    > HPDinterval(rA)
    lower upper
    var1 0 0
    attr(,"Probability")
    [1] 0.95

    ReplyDelete
    Replies
    1. Weird. I have seen that error when I fail to specify the 'types' of each character correctly (for instance, if one of the characters is a continuous character but I put types=c("disc","disc")), or, likewise, if I have missing data values in my input data.

      Delete
    2. Thank you for your responce.
      Yes, it's weird. I have check the code with a different tree and different data (worked fine, with more "normal" values), but in the case of the dataset where I have 0, both the continous and discreat date give the same result. I don't have any missing values. The code also gives an error if there are any NAs.

      Delete

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.