Phylogenetic Tools for Comparative Biology: Function to simulate variable-rate Mk model mapped onto the tree

A little while ago I posted a function to fit multiple Mk models to different painted 'regimes' on the tree.

This method differs from others such as the 'covarion' model of Penny et al. (2001) or the related hidden rates model of Beaulieu et al. (2013) in that it treats the regimes as 'known' - rather than integrating over uncertainty in their painting. Thus, the approach is more appropriate for comparing among a limited number of a priori hypotheses - such as a difference in rate between clades or across time periods (such as geological eras), rather than for testing for a relationship between two or more discrete traits or for trying to identify changes in the rate of character evolution for a single discrete trait, absent an a priori hypothesis regarding how the rate may have changed.

I'm trying to write a short note describing this method, but first I need a simulator for the model. In a prior post, I simulated under a variable-rate model merely by stretching the branch lengths of the tree. Unfortunately, this only works if Q₂ is a scalar multiple of Q₁. For more complex scenarios in which the various Q_i we'd like to simulate are not proportional we'll need a tool that is a tiny bit more sophisticated.

Using the phytools "simmap" object class to encode our paintings, the way I decided to do this was first by converting our "simmap" object to an ordinary "phylo" object with singleton nodes instead of mappings along edges. This way, we can only have one regime per edge. Then I proceeded across all the edges of the tree with singletons and for each edge computed P=expm(Q_i×t) in which Q_i indicates the correct transition matrix for that edge, t is the length of the edge, expm is the matrix exponential, and P is the matrix of transition probabilities. Next, I proceeded to carry out a single preorder tree traversal in which every node is visited, a character state is assigned using the state at the parent node and the computed matrix of transition probabilities, P, and then the algorithm proceeds to the daughters of the current node until all nodes have been visited.

Here is what that code looks like:

sim.multiMk<-function(tree,Q,anc=NULL,...){
    ss<-rownames(Q[[1]])
    tt<-map.to.singleton(reorder(tree))
    P<-vector(mode="list",length=nrow(tt$edge))
    for(i in 1:nrow(tt$edge))
        P[[i]]<-expm(Q[[names(tt$edge.length)[i]]]*tt$edge.length[i])
    if(is.null(anc)) anc<-sample(ss,1)
    STATES<-matrix(NA,nrow(tt$edge),2)
    root<-Ntip(tt)+1
    STATES[which(tt$edge[,1]==root),1]<-anc
    for(i in 1:nrow(tt$edge)){
        new<-ss[which(rmultinom(1,1,P[[i]][STATES[i,1],])[,1]==1)]
        STATES[i,2]<-new
        ii<-which(tt$edge[,1]==tt$edge[i,2])
        if(length(ii)>0) STATES[ii,1]<-new
    }
    x<-as.factor(
        setNames(sapply(1:Ntip(tt),function(n,S,E) S[which(E==n)],
        S=STATES[,2],E=tt$edge[,2]),tt$tip.label))
    x
}

For completeness, here's an equivalent simulator for a single Q matrix:

sim.Mk<-function(tree,Q,anc=NULL,...){
    ss<-rownames(Q)
    tt<-reorder(tree)
    P<-vector(mode="list",length=nrow(tt$edge))
    for(i in 1:nrow(tt$edge))
        P[[i]]<-expm(Q*tt$edge.length[i])
    if(is.null(anc)) anc<-sample(ss,1)
    STATES<-matrix(NA,nrow(tt$edge),2)
    root<-Ntip(tt)+1
    STATES[which(tt$edge[,1]==root),1]<-anc
    for(i in 1:nrow(tt$edge)){
        new<-ss[which(rmultinom(1,1,P[[i]][STATES[i,1],])[,1]==1)]
        STATES[i,2]<-new
        ii<-which(tt$edge[,1]==tt$edge[i,2])
        if(length(ii)>0) STATES[ii,1]<-new
    }
    x<-as.factor(
        setNames(sapply(1:Ntip(tt),function(n,S,E) S[which(E==n)],
        S=STATES[,2],E=tt$edge[,2]),tt$tip.label))
    x
}

For fun, let's imagine the following regimes:

library(phytools)
colors<-setNames(c("blue","red"),0:1)
plot(tree,lwd=1,colors=colors,ftype="off")

plot of chunk unnamed-chunk-3

& then let's simulate two characters, x & y on this tree in which x evolved via a variable-rate process, whereas y evolved under a single set of transition rates:

Qx<-setNames(
    list(matrix(c(-0.5,0.5,0.5,-0.5),2,2,dimnames=list(letters[1:2],
    letters[1:2])),
    matrix(c(-5,5,5,-5),2,2,dimnames=list(letters[1:2],
    letters[1:2]))),0:1)
Qx

## $`0`
##      a    b
## a -0.5  0.5
## b  0.5 -0.5
## 
## $`1`
##    a  b
## a -5  5
## b  5 -5

Qy<-matrix(c(-1,1,1,-1),2,2,dimnames=list(letters[1:2],letters[1:2]))
Qy

##    a  b
## a -1  1
## b  1 -1

x<-sim.multiMk(tree,Qx)
plot(tree,colors,lwd=1,ftype="off",type="fan",mar=c(1.1,1.1,4.1,1.1))
par(fg="transparent")
tiplabels(pie=to.matrix(x,letters[1:2]),
    piecol=RColorBrewer::brewer.pal(n=3,"PRGn")[c(1,3)],
    cex=0.3)
par(fg="black")
title(main="Discrete trait simulated via a variable-rate process")

plot of chunk unnamed-chunk-4

y<-sim.Mk(tree,Qy)
plot(tree,colors,lwd=1,ftype="off",type="fan",mar=c(1.1,1.1,4.1,1.1))
par(fg="transparent")
tiplabels(pie=to.matrix(y,letters[1:2]),
    piecol=RColorBrewer::brewer.pal(n=3,"PRGn")[c(1,3)],
    cex=0.3)
par(fg="black")
title(main="Discrete trait simulated via a constant-rate process")

plot of chunk unnamed-chunk-4

Finally, we can fit both the constant- and variable-rate models to each data vector:

fit1.x<-fitMk(tree,x)
fit1.x

## Object of class "fitMk".
## 
## Fitted (or set) value of Q:
##           a         b
## a -1.140655  1.140655
## b  1.140655 -1.140655
## 
## Fitted (or set) value of pi:
##   a   b 
## 0.5 0.5 
## 
## Log-likelihood: -102.305606 
## 
## Optimization method used was "nlminb"

fitmulti.x<-fitmultiMk(tree,x)
fitmulti.x

## Object of class "fitmultiMk".
## 
## Fitted value of Q[0]:
##           a         b
## a -0.641288  0.641288
## b  0.641288 -0.641288
## 
## Fitted value of Q[1]:
##           a         b
## a -5.843021  5.843021
## b  5.843021 -5.843021
## 
## Fitted (or set) value of pi:
##   a   b 
## 0.5 0.5 
## 
## Log-likelihood: -95.203038 
## 
## Optimization method used was "nlminb"

fit1.y<-fitMk(tree,y)
fit1.y

## Object of class "fitMk".
## 
## Fitted (or set) value of Q:
##           a         b
## a -1.079919  1.079919
## b  1.079919 -1.079919
## 
## Fitted (or set) value of pi:
##   a   b 
## 0.5 0.5 
## 
## Log-likelihood: -101.832729 
## 
## Optimization method used was "nlminb"

fitmulti.y<-fitmultiMk(tree,y)
fitmulti.y

## Object of class "fitmultiMk".
## 
## Fitted value of Q[0]:
##           a         b
## a -1.141888  1.141888
## b  1.141888 -1.141888
## 
## Fitted value of Q[1]:
##           a         b
## a -0.967775  0.967775
## b  0.967775 -0.967775
## 
## Fitted (or set) value of pi:
##   a   b 
## 0.5 0.5 
## 
## Log-likelihood: -101.770944 
## 
## Optimization method used was "nlminb"

and even compare between models:

library(lmtest)
LR.x<-suppressWarnings(lrtest(fit1.x,fitmulti.x))
LR.x

## Likelihood ratio test
## 
## Model 1: fit1.x
## Model 2: fitmulti.x
##   #Df   LogLik Df  Chisq Pr(>Chisq)    
## 1   1 -102.306                         
## 2   2  -95.203  1 14.205  0.0001639 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

LR.y<-suppressWarnings(lrtest(fit1.y,fitmulti.y))
LR.y

## Likelihood ratio test
## 
## Model 1: fit1.y
## Model 2: fitmulti.y
##   #Df  LogLik Df  Chisq Pr(>Chisq)
## 1   1 -101.83                     
## 2   2 -101.77  1 0.1236     0.7252

Neat.

Phylogenetic Tools for Comparative Biology

Pages

Saturday, February 3, 2018

Function to simulate variable-rate Mk model mapped onto the tree

No comments:

Post a Comment