I just posted some
code to fit
the Pagel (1994) method
for detecting “correlated” evolution of two binary traits. I say
*correlated* because it might be more appropriate to call this a
*dependence* - i.e., the rate of character 2 depends on 1 & *vice
versa*.

This is actually quite simple to do because the test is merely a test of, for
a two character combination written as `[01]`

(meaning state
`0`

for the first character & state `1`

for the second),
the rates of transition from `[00]->[01]`

equal
`[10]->[11]`

; `[00]->[10]`

equal
`[01]->[11]`

; etc. For this we can just paste our two input
characters together, fit both a model in which the dependence exists and
one in which it does not, and then compare them. The dependence model has
twice as many transition rates as the independence model, so this model will
have (for two binary traits) 4 more parameters.

Note that I accomplished this by borrowing `ace`

in the ape
package or `fitDiscrete`

in the geiger package to actually fit
the models. All my
code really
does is pre- & post-process the data & results respectively!

Here's a demo:

```
## first load packages & source code
library(phytools)
library(geiger)
source("fitPagel.R")
.check.pkg<-phytools:::.check.pkg
## now let's simulate some uncorrelated data
tree<-pbtree(n=300,scale=1)
Q<-matrix(c(-1,1,1,-1),2,2)
rownames(Q)<-colnames(Q)<-letters[1:2]
tt1<-sim.history(tree,Q)
```

```
## Done simulation(s).
```

```
tt2<-sim.history(tree,Q)
```

```
## Done simulation(s).
```

```
## these are uncorrelated, see:
par(mfrow=c(1,2))
plotSimmap(tt1,setNames(c("blue","red"),letters[1:2]),ftype="off",lwd=1)
plotSimmap(tt2,setNames(c("blue","red"),letters[1:2]),ftype="off",lwd=1,direction="leftwards")
```

```
x<-tt1$states
y<-tt2$states
fit.ape<-fitPagel(tree,x,y)
fit.ape
```

```
##
## Pagel's binary character correlation test:
##
## Indepedent model rate matrix:
## a|a a|b b|a b|b
## a|a -1.9712 1.0896 0.8816 0.0000
## a|b 0.9774 -1.8589 0.0000 0.8816
## b|a 0.7531 0.0000 -1.8426 1.0896
## b|b 0.0000 0.7531 0.9774 -1.7304
##
## Dependent model rate matrix:
## a|a a|b b|a b|b
## a|a -1.5606 0.8578 0.7028 0.0000
## a|b 0.9261 -1.8634 0.0000 0.9373
## b|a 0.3652 0.0000 -1.5764 1.2112
## b|b 0.0000 1.2754 1.0567 -2.3321
##
## Model fit:
## log-likelihood
## independent -238.7
## dependent -237.5
##
## Hypothesis test result:
## likelihood-ratio: 2.390211
## p-value: 0.6643971
```

```
fit.geiger<-fitPagel(tree,x,y,method="fitDiscrete")
```

```
## Warning: Parameter estimates appear at bounds:
## q14
## q23
## q32
## q41
## Warning: Parameter estimates appear at bounds:
## q14
## q23
## q32
## q41
```

```
fit.geiger
```

```
##
## Pagel's binary character correlation test:
##
## Indepedent model rate matrix:
## a|a a|b b|a b|b
## a|a -1.8847 0.9942 0.8906 0.0000
## a|b 0.9699 -1.8605 0.0000 0.8906
## b|a 0.7112 0.0000 -1.7053 0.9942
## b|b 0.0000 0.7112 0.9699 -1.6811
##
## Dependent model rate matrix:
## a|a a|b b|a b|b
## a|a -1.5000 0.8192 0.6809 0.0000
## a|b 0.8987 -1.8596 0.0000 0.9609
## b|a 0.3704 0.0000 -1.4820 1.1116
## b|b 0.0000 1.1176 1.1075 -2.2251
##
## Model fit:
## log-likelihood
## independent -239.3
## dependent -238.3
##
## Hypothesis test result:
## likelihood-ratio: 2.096727
## p-value: 0.7179738
```

OK, now let's try correlated data. To do that I will have to build a big transition matrix myself for simulation!

```
Q<-matrix(c(0,0.5,0.5,0,2,0,0,2,2,0,0,2,0,0.5,0.5,0),4,4,byrow=TRUE)
rownames(Q)<-colnames(Q)<-c("aa","ab","ba","bb")
diag(Q)<--rowSums(Q)
tt<-sim.history(tree,t(Q))
```

```
## Note - the rate of substitution from i->j should be given by Q[j,i].
## Done simulation(s).
```

```
tt1<-mergeMappedStates(tt,c("aa","ab"),"a")
tt1<-mergeMappedStates(tt1,c("ba","bb"),"b")
tt2<-mergeMappedStates(tt,c("aa","ba"),"a")
tt2<-mergeMappedStates(tt2,c("ab","bb"),"b")
## these data are correlated, see:
par(mfrow=c(1,2))
plotSimmap(tt1,setNames(c("blue","red"),letters[1:2]),ftype="off",lwd=1)
plotSimmap(tt2,setNames(c("blue","red"),letters[1:2]),ftype="off",lwd=1,direction="leftwards")
```

```
x<-getStates(tt1,"tips")
y<-getStates(tt2,"tips")
fit.ape<-fitPagel(tree,x,y)
fit.ape
```

```
##
## Pagel's binary character correlation test:
##
## Indepedent model rate matrix:
## a|a a|b b|a b|b
## a|a -1.4325 0.9831 0.4493 0.0000
## a|b 0.9593 -1.4086 0.0000 0.4493
## b|a 1.0401 0.0000 -2.0233 0.9831
## b|b 0.0000 1.0401 0.9593 -1.9994
##
## Dependent model rate matrix:
## a|a a|b b|a b|b
## a|a -1.043 0.8539 0.1893 0.000
## a|b 5.543 -7.9019 0.0000 2.359
## b|a 1.202 0.0000 -4.4114 3.209
## b|b 0.000 1.0849 0.4054 -1.490
##
## Model fit:
## log-likelihood
## independent -241.1
## dependent -207.8
##
## Hypothesis test result:
## likelihood-ratio: 66.63346
## p-value: 1.164753e-13
```

```
fit.geiger<-fitPagel(tree,x,y,method="fitDiscrete")
```

```
## Warning: Parameter estimates appear at bounds:
## q14
## q23
## q32
## q41
## Warning: Parameter estimates appear at bounds:
## q14
## q23
## q32
## q41
```

```
fit.geiger
```

```
##
## Pagel's binary character correlation test:
##
## Indepedent model rate matrix:
## a|a a|b b|a b|b
## a|a -1.3892 1.006 0.3832 0.0000
## a|b 0.9236 -1.307 0.0000 0.3832
## b|a 1.0639 0.000 -2.0699 1.0061
## b|b 0.0000 1.064 0.9236 -1.9875
##
## Dependent model rate matrix:
## a|a a|b b|a b|b
## a|a -0.8951 0.8951 1.799e-16 0.000
## a|b 5.4502 -7.7907 0.000e+00 2.340
## b|a 1.9288 0.0000 -3.958e+00 2.029
## b|b 0.0000 1.0086 4.105e-01 -1.419
##
## Model fit:
## log-likelihood
## independent -241.9
## dependent -208.8
##
## Hypothesis test result:
## likelihood-ratio: 66.16744
## p-value: 1.460391e-13
```

That's it, I guess. This is brand new so I have no idea how it works but feedback is welcome. I'm also well aware of the recent critique of this method by Maddision & Fitzjohn.

great post! I'm new to the field and this really helps getting some orientation

ReplyDeleteThanks for this.

ReplyDeleteI am still unsure about one thing though: How does one interpret the rates in the Independent & Dependent model rate matrices? What are the units? Do the cells contain the expected number of substitutions per tree? or per tree height? ... Or is it more complicated than that?

Thank you!

Hi Caitlin.

DeleteThe rate Q(a|a -> b|a), for instance, is the instantaneous rate of transition from a->b in character 1, given that character 2 is in state a. In the independent model, the rate of change in 1 cannot be influenced by character 2, consequently Q(a|a->b|a)==Q(a|b->b|b) & so on.

The rates become probabilities after matrix exponentiation of Q*t for elapsed time t; however another intuitive interpretation is that for a lineage starting in state a with Q(a->b)=2, the expected number of changes in a unit of time is 2.

Hope this helps. Liam

Hello again,

ReplyDeleteThank you very much for your advice.

I believe I understand your first point. The example you gave would correspond to the fact that in an independent Q rate matrix, the contents of the cells [1,3] and [2,4] will be equal, correct?

With regard to your second point, though, I confess to remaining a little unsure of myself. I think I agree that the rates are more easily interpreted as numbers of substitutions per unit time. If you have a moment, could I please try to run a small example past you to see if I am understanding you correctly?

Suppose I have an independent rate matrix, iQ, and I want to use the rates estimated therein to simulate data on a given tree. The rate Q(b|a->a|a) (also Q(b|b->a|b)) in my iQ is 0.79, which is the expected number of changes to occur in a unit time.

My tree is not a timed tree, but I can use max(nodeHeights(tree)) to get its height, which is 1.4. The sum of the branch lengths in my tree is 9.5. Am I correct in thinking that there are 9.5/1.4=6.79 “units of time” in my tree? If so, would the expected value of the number of substitutions to occur for this variable, throughout the tree, be 0.79*6.79=5.36?

Finally, if I wanted to simulate data for this variable on this tree using this rate estimate, what would be the recommended procedure for getting an actual value for the number of substitutions to occur in a given simulation (for example, should I round down to 5, or draw a number from a Poisson distribution with parameter 5.36)?

I really appreciate your taking the time to read my comments and offer your input.

Thanks sincerely,

Caitlin.

Hi,

ReplyDeleteEdit--sorry, I just realised it probably makes more sense for 1 to be one unit of time in my tree... If the sum of the branch lengths is 9.5, are there 9.5 "units of time" in my tree?

Best,

Caitlin.

Yes, the amount of time in your tree for things to occur is the sum of the edge lengths. Exactly.

Delete