Monday, September 11, 2017

Extension of ratebytree for discretely valued characters

As I mentioned the other day, during my sabbatical at the Universidad de los Andes here in Bogotá I submitted a paper with some colleagues describing the function ratebytree for comparing the evolutionary rates of a continuous character between trees.

The manuscript was favorably reviewed, but in their typically pesky* (*said with affection) manner, reviewers asked us to add different models of evolution - the feature that I added last week. They also asked us to extend the approach to discrete character evolution. This was something we pointed out could be done in the manuscript, but had not explicitly addressed. It is theoretically straightforward - although not completely trivial - particularly if you want your new function updates to be accompanied by legible print methods & so on.

I have, however, now added this feature to the function ratebytree. The function can take an argument, type (which should assume the value "continuous" or "discrete"); however, if not provided, the function will attempt to ascertain the character type.

Here is a quick demo using some simple simulated data:

library(phytools)
packageVersion("phytools")
## [1] '0.6.25'

First, our trees & data:

t1
## 
## Phylogenetic tree with 26 tips and 25 internal nodes.
## 
## Tip labels:
##  A, B, C, D, E, F, ...
## 
## Rooted; includes branch lengths.
x1
## A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 
## c c c c a c b c b c a c b a b c b b a c a c c c c c 
## Levels: a b c
t2
## 
## Phylogenetic tree with 40 tips and 39 internal nodes.
## 
## Tip labels:
##  t7, t34, t35, t3, t29, t30, ...
## 
## Rooted; includes branch lengths.
x2
##  t7 t34 t35  t3 t29 t30 t31 t14 t15  t2 t39 t40  t9  t8 t21 t22 t32 t33 
##   b   a   a   c   b   b   b   b   c   b   a   a   c   c   a   a   c   c 
## t23 t24 t19 t20 t11  t5  t6 t25 t26  t4 t36 t37 t38 t12 t13 t27 t28 t16 
##   c   c   b   b   c   c   a   c   c   a   b   b   b   c   c   c   b   c 
##  t1 t10 t17 t18 
##   a   a   a   a 
## Levels: a b c
## or
dotTree(t3,x3,ftype="off")

plot of chunk unnamed-chunk-2

x<-list(x1,x2,x3)
x
## [[1]]
## A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 
## c c c c a c b c b c a c b a b c b b a c a c c c c c 
## Levels: a b c
## 
## [[2]]
##  t7 t34 t35  t3 t29 t30 t31 t14 t15  t2 t39 t40  t9  t8 t21 t22 t32 t33 
##   b   a   a   c   b   b   b   b   c   b   a   a   c   c   a   a   c   c 
## t23 t24 t19 t20 t11  t5  t6 t25 t26  t4 t36 t37 t38 t12 t13 t27 t28 t16 
##   c   c   b   b   c   c   a   c   c   a   b   b   b   c   c   c   b   c 
##  t1 t10 t17 t18 
##   a   a   a   a 
## Levels: a b c
## 
## [[3]]
## t19 t20  t3 t11 t12 t36 t37 t21 t24 t25 t10 t13 t14  t2  t7  t8  t9 t16 
##   a   a   a   b   a   c   b   a   a   a   b   a   a   a   c   b   a   c 
## t28 t29 t32 t33 t17 t34 t35 t18  t1 t39 t40 t22 t23 t41 t42  t4  t5 t26 
##   b   b   b   b   b   b   b   b   b   c   c   c   c   b   b   b   a   b 
## t27  t6 t49 t50 t15 t43 t44 t45 t46 t30 t31 t47 t48 t38 
##   b   b   c   c   b   c   c   b   b   c   c   b   b   b 
## Levels: a b c
trees<-c(t1,t2,t3)
trees
## 3 phylogenetic trees

OK, now let's try it! First, I will assume a symmetric model of transitions between states:

fitSYM<-ratebytree(trees,x,type="discrete",model="SYM")
fitSYM
## ML common-rate model:
##          a<->b   a<->c   b<->c   k   logL
## value    1.3136  0.0815  3.0745  3   -100.7769
## 
## Model fitting method was "optim".
## 
## ML multi-rate model:
##          a<->b   a<->c   b<->c   k   logL
## tree1    3.5598  0   5.2209  9   -99.2075
## tree2    0.2588  0.5751  1.5436
## tree3    1.3966  0   3.2116
## 
## Model fitting method was "nlminb".
## 
## Likelihood ratio: 3.1388 
## P-value (based on X^2): 0.7912

This tells us that we have little evidence to reject a constant (set of) rate(s) across our trees. We can nonetheless try simpler or more complex models and see if the outcome is similar or different. Remember, I said the function would attempt to ascertain if the character is discretely or continuously-valued, so let's try that:

fitER<-ratebytree(trees,x)
fitER
## ML common-rate model:
##          q       k   logL
## value    1.1995  1   -104.2243
## 
## Model fitting method was "optimize".
## 
## ML multi-rate model:
##          q       k   logL
## tree1    2.1052  3   -103.4967
## tree2    0.8053
## tree3    1.4482
## 
## Model fitting method was "nlminb".
## 
## Likelihood ratio: 1.4552 
## P-value (based on X^2): 0.4831
fitARD<-ratebytree(trees,x,model="ARD")
fitARD
## ML common-rate model:
##          b->a    c->a    a->b    c->b    a->c    b->c    k   logL
## value    -0.0483 1.1847  2.3293  1.5367  -0.7321 3.8122  6   -99.4653
## 
## Model fitting method was "optim".
## 
## ML multi-rate model:
##          b->a    c->a    a->b    c->b    a->c    b->c    k   logL
## tree1    4.0703  0   4.1987  7.5158  0   16.8153 18  -94.8226
## tree2    0.1143  0.5863  0   1.2591  0.1455  2.3226
## tree3    0.6492  0.2584  1.9163  4.7043  0   2.7484
## 
## Model fitting method was "nlminb".
## 
## Likelihood ratio: 9.2853 
## P-value (based on X^2): 0.6784

By chance, these data were simulated as follows - and thus without a difference in rate or process between trees:

t1<-pbtree(n=26,tip.label=LETTERS,scale=1)
t2<-pbtree(n=40,scale=2)
t3<-pbtree(n=50,scale=1)
Q<-matrix(c(-2,1,1,1,-2,1,1,1,-2),3,3)
rownames(Q)<-colnames(Q)<-letters[1:3]
x1<-as.factor(sim.history(t1,Q)$states)
x2<-as.factor(sim.history(t2,Q)$states)
x3<-as.factor(sim.history(t3,Q)$states)

Why don't we simulate under conditions in which the process differs between trees and see if we obtain a different result:

t4<-pbtree(n=60,scale=1)
t5<-pbtree(n=70,scale=2)
t6<-pbtree(n=80,scale=1)
Q1<-matrix(c(-1,1,1,-1),2,2)
Q2<-matrix(c(-2,2,2,-2),2,2)
x4<-as.factor(sim.history(t4,Q1)$states)
x5<-as.factor(sim.history(t5,Q1)$states)
x6<-as.factor(sim.history(t6,Q2)$states)

Now we're ready to fit our models:

fitER<-ratebytree(c(t4,t5,t6),list(x4,x5,x6))
fitER
## ML common-rate model:
##          q       k   logL
## value    1.2509  1   -116.4188
## 
## Model fitting method was "optimize".
## 
## ML multi-rate model:
##          q       k   logL
## tree1    0.8768  3   -112.826
## tree2    0.7286
## tree3    2.7957
## 
## Model fitting method was "nlminb".
## 
## Likelihood ratio: 7.1856 
## P-value (based on X^2): 0.0275

This suggests that a model in which the different trees are permitted to have different rates of evolution fits better than one in which they are all constrained to have equal rates - just as we simulated.

Finally, to get this functionality you should update phytools from GitHub as follows:

library(devtools)
install_github("liamrevell/phytools")

More later.

No comments:

Post a Comment