Sunday, March 3, 2019

Mk model for intraspecifically polymorphic discrete characters

I just added a new method & function to the phytools package in the form of fitpolyMk.

This function fits an (extended) Mk model to character data with intraspecific polymorphism.

The general idea here is very simple. We just imagine that every character changes between state by going through an intermediate polymorphic condition. So, for instance, in a set of taxa with states A, B, and the polymorphic condition, A + B, we suppose that evolutionary transitions between these three possibilities much occur as follows: AA + BB. For more than two states we might further suppose that (for example) a transition from A + B to B + C would have to go through the intermediate condition of only B (i.e., A + BBB + C), and that transitions to (say) A + B + C could only occur in one step from A + B, A + C, or B + C.

In the function I have implemented the standard suite of models ("ER", "SYM", and "ARD"), which assume that all permitted transitions occur at the same rate ("ER"), that backward & forward transitions occur at the same rate ("SYM" e.g., qAA + B = qA + BA, and so on), or that all allowed transitions can occur at different rates ("ARD"). I also added a new model called "transient" in which I assumed that the acquisition of polymorphism (e.g., AA + B) occurrs at one rate, q1, and its loss (e.g., A + BA) at another (q2). I called it the transient model because I imagined that under some circumstances polymorphism might be evolutionarily more fleeting (or transient) than monomorphism. Finally, I also permit the user to specify that the model be 'ordered' (e.g., AA + BBB + CC) or unordered. Only in the latter should multistate polymorphism (e.g., A + B + C) occur.

The user is required to code the data in the normal way, merely separating each state in a polymorphic taxon by the + character.

Here's a quick example:

library(phytools)
packageVersion("phytools")
## [1] '0.6.67'
tree
## 
## Phylogenetic tree with 200 tips and 199 internal nodes.
## 
## Tip labels:
##  t142, t143, t27, t28, t160, t161, ...
## 
## Rooted; includes branch lengths.
x
## t142 t143  t27  t28 t160 t161 t141  t37  t59 t136 t137  t24  t18  t19 t102 
##    C    C    C    C    C    C    C  B+C  B+C    C    C  B+C    C  B+C    C 
## t103  t30 t113 t114  t38   t7  t51  t52  t70  t71  t35 t149 t150  t83  t34 
##    C    B  B+C  B+C    B    C  B+C  B+C    B    B    B  A+B    A    B  A+B 
##  t21  t81  t82 t138 t171 t172 t179 t180 t156 t157 t151 t144 t127 t128   t3 
##    A    B  A+B    B    B    B    B    B  B+C  B+C  B+C  B+C  B+C  B+C    C 
##   t6  t31  t84  t85  t39  t25  t11  t29  t42 t183 t184 t124   t9  t68  t69 
##    C    C    C    C    C  B+C    C    C    C    C    C    C  B+C    B    B 
## t129 t130 t121  t53 t109 t110  t89  t13  t44 t195 t196  t95  t10  t99 t105 
##    B    B    B    B    B    B    B  A+B    B    B    B    B    B  B+C    B 
## t106  t60 t187 t188 t177 t178  t57  t12 t169 t170  t94 t131 t132  t40  t41 
##  B+C    B  B+C  B+C  B+C  B+C    C    B    B    B    B    B    B    C    C 
##   t8  t49  t50  t14 t162 t163  t93 t117 t118 t133 t134  t76  t17   t2 t165 
##    C    C  B+C    C    B    B    B    B    B    B    B    B    B    B    B 
## t166 t139 t140 t173 t174 t152 t153   t4  t45  t46 t191 t192 t122 t123  t22 
##    B    B    B    B    B    B    B  A+B    B    B    C    C  B+C    B    B 
##  t23   t5  t20  t79  t80  t47  t48  t98 t189 t190 t185 t186  t63 t125 t126 
##    B    B    C    C    C  B+C  B+C  B+C  B+C  B+C    C    C    C    C    C 
## t112  t36  t16  t87  t88   t1 t193 t194  t92  t62 t158 t159 t154 t155  t15 
##    C    C    C    C  B+C    C  B+C  B+C  B+C    B  B+C  B+C  B+C  B+C    C 
##  t54  t64  t65  t77  t78 t199 t200 t164  t32  t33  t43  t90  t91  t74  t75 
##  B+C  B+C  B+C  B+C  B+C    B    B    B  B+C    B  B+C  B+C  B+C  B+C  B+C 
##  t72  t73  t66  t58 t100 t101  t55  t56  t86 t181 t182 t135 t119 t120  t67 
##  B+C  B+C    B  B+C  B+C    C    C  B+C  B+C  B+C  B+C  B+C  B+C  B+C  B+C 
## t175 t176 t107 t108 t197 t198 t167 t168 t111 t145 t146  t26 t104 t147 t148 
##  B+C  B+C    B  A+B  A+B  A+B    B    B  B+C    B  B+C    B    B    B    B 
##  t97 t115 t116  t96  t61 
##    B  B+C  B+C    B    B 
## Levels: A A+B B B+C C
## plot our data:
plotTree(tree,ftype="off",lwd=1,type="fan")
X<-strsplit(setNames(as.character(x),names(x)),"+",fixed=TRUE)
pies<-matrix(0,Ntip(tree),3,dimnames=list(tree$tip.label,c("A","B","C")))
for(i in 1:Ntip(tree)) 
    pies[tree$tip.label[i],X[[tree$tip.label[i]]]]<-rep(1/length(X[[tree$tip.label[i]]]),
        length(X[[tree$tip.label[i]]]))
tiplabels(pie=pies,piecol=c("black","yellow","red"),cex=0.35)
legend(x="topleft",legend=c("A","B","C"),pt.cex=2,pch=21,
    pt.bg=c("black","yellow","red"))

plot of chunk unnamed-chunk-1

Now, let's fit a few different models:

ard.unordered<-fitpolyMk(tree,x,model="ARD")
## 
## This is the design matrix of the fitted model. Does it make sense?
## 
##       A B  C A+B A+C B+C A+B+C
## A     0 0  0   1   3   0     0
## B     0 0  0   5   0   7     0
## C     0 0  0   0   9  11     0
## A+B   2 6  0   0   0   0    13
## A+C   4 0 10   0   0   0    15
## B+C   0 8 12   0   0   0    17
## A+B+C 0 0  0  14  16  18     0
ard.unordered
## Object of class "fitpolyMk".
## 
## Evolution was modeled as 'unordered' using the "ARD" model.
## 
## Fitted (or set) value of Q:
##               A         B         C        A+B       A+C       B+C
## A     -71.33397  0.000000  0.000000  71.333965  0.000000  0.000000
## B       0.00000 -1.872671  0.000000   1.006799  0.000000  0.865872
## C       0.00000  0.000000 -1.322289   0.000000  0.000000  1.322289
## A+B    20.05692  7.623430  0.000000 -27.680355  0.000000  0.000000
## A+C     0.00000  0.000000  0.000000   0.000000 -8.847803  0.000000
## B+C     0.00000  1.723192  1.658633   0.000000  0.000000 -3.381825
## A+B+C   0.00000  0.000000  0.000000   0.000000  0.000000  8.281669
##           A+B+C
## A      0.000000
## B      0.000000
## C      0.000000
## A+B    0.000000
## A+C    8.847803
## B+C    0.000000
## A+B+C -8.281669
## 
## Fitted (or set) value of pi:
##         A         B         C       A+B       A+C       B+C     A+B+C 
## 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 
## 
## Log-likelihood: -154.755089 
## 
## Optimization method used was "nlminb"
ard.ordered<-fitpolyMk(tree,x,model="ARD",ordered=TRUE)
## 
## This is the design matrix of the fitted model. Does it make sense?
## 
##     A A+B B B+C C
## A   0   1 0   0 0
## A+B 2   0 3   0 0
## B   0   4 0   5 0
## B+C 0   0 6   0 7
## C   0   0 0   8 0
ard.ordered
## Object of class "fitpolyMk".
## 
## Evolution was modeled as 'ordered' (i.e., transitions are assumed
## to occur A <-> A+B <-> B and so on) using the "ARD" model.
## 
## Fitted (or set) value of Q:
##             A        A+B         B       B+C         C
## A   -71.85153  71.851529  0.000000  0.000000  0.000000
## A+B  19.64117 -27.727843  8.086678  0.000000  0.000000
## B     0.00000   0.771046 -2.385605  1.614559  0.000000
## B+C   0.00000   0.000000  1.280922 -3.389113  2.108191
## C     0.00000   0.000000  0.000000  0.723070 -0.723070
## 
## Fitted (or set) value of pi:
##   A A+B   B B+C   C 
## 0.2 0.2 0.2 0.2 0.2 
## 
## Log-likelihood: -154.966753 
## 
## Optimization method used was "nlminb"
sym.unordered<-fitpolyMk(tree,x,model="SYM")
## 
## This is the design matrix of the fitted model. Does it make sense?
## 
##       A B C A+B A+C B+C A+B+C
## A     0 0 0   1   2   0     0
## B     0 0 0   3   0   4     0
## C     0 0 0   0   5   6     0
## A+B   1 3 0   0   0   0     7
## A+C   2 0 5   0   0   0     8
## B+C   0 4 6   0   0   0     9
## A+B+C 0 0 0   7   8   9     0
## Warning in log(comp[1:M + N]): NaNs produced
sym.unordered
## Object of class "fitpolyMk".
## 
## Evolution was modeled as 'unordered' using the "SYM" model.
## 
## Fitted (or set) value of Q:
##               A         B         C       A+B       A+C       B+C
## A     -5.969063  0.000000  0.000000  5.969063  0.000000  0.000000
## B      0.000000 -2.081912  0.000000  0.713146  0.000000  1.368766
## C      0.000000  0.000000 -1.628471  0.000000  0.000000  1.628471
## A+B    5.969063  0.713146  0.000000 -6.682209  0.000000  0.000000
## A+C    0.000000  0.000000  0.000000  0.000000 -0.427551  0.000000
## B+C    0.000000  1.368766  1.628471  0.000000  0.000000 -2.997237
## A+B+C  0.000000  0.000000  0.000000  0.000000  0.427551  0.000000
##           A+B+C
## A      0.000000
## B      0.000000
## C      0.000000
## A+B    0.000000
## A+C    0.427551
## B+C    0.000000
## A+B+C -0.427551
## 
## Fitted (or set) value of pi:
##         A         B         C       A+B       A+C       B+C     A+B+C 
## 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 
## 
## Log-likelihood: -160.029691 
## 
## Optimization method used was "nlminb"
sym.ordered<-fitpolyMk(tree,x,model="SYM",ordered=TRUE)
## 
## This is the design matrix of the fitted model. Does it make sense?
## 
##     A A+B B B+C C
## A   0   1 0   0 0
## A+B 1   0 2   0 0
## B   0   2 0   3 0
## B+C 0   0 3   0 4
## C   0   0 0   4 0
sym.ordered
## Object of class "fitpolyMk".
## 
## Evolution was modeled as 'ordered' (i.e., transitions are assumed
## to occur A <-> A+B <-> B and so on) using the "SYM" model.
## 
## Fitted (or set) value of Q:
##             A       A+B         B       B+C         C
## A   -5.969319  5.969319  0.000000  0.000000  0.000000
## A+B  5.969319 -6.682465  0.713146  0.000000  0.000000
## B    0.000000  0.713146 -2.081906  1.368761  0.000000
## B+C  0.000000  0.000000  1.368761 -2.997227  1.628467
## C    0.000000  0.000000  0.000000  1.628467 -1.628467
## 
## Fitted (or set) value of pi:
##   A A+B   B B+C   C 
## 0.2 0.2 0.2 0.2 0.2 
## 
## Log-likelihood: -159.693219 
## 
## Optimization method used was "nlminb"
transient.unordered<-fitpolyMk(tree,x,model="transient")
## 
## This is the design matrix of the fitted model. Does it make sense?
## 
##       A B C A+B A+C B+C A+B+C
## A     0 0 0   2   2   0     0
## B     0 0 0   2   0   2     0
## C     0 0 0   0   2   2     0
## A+B   1 1 0   0   0   0     2
## A+C   1 0 1   0   0   0     2
## B+C   0 1 1   0   0   0     2
## A+B+C 0 0 0   1   1   1     0
transient.unordered
## Object of class "fitpolyMk".
## 
## Evolution was modeled as 'unordered' using the "transient" model.
## 
## Fitted (or set) value of Q:
##               A         B         C       A+B       A+C       B+C
## A     -0.696158  0.000000  0.000000  0.348079  0.348079  0.000000
## B      0.000000 -0.696158  0.000000  0.348079  0.000000  0.348079
## C      0.000000  0.000000 -0.696158  0.000000  0.348079  0.348079
## A+B    1.572038  1.572038  0.000000 -3.492154  0.000000  0.000000
## A+C    1.572038  0.000000  1.572038  0.000000 -3.492154  0.000000
## B+C    0.000000  1.572038  1.572038  0.000000  0.000000 -3.492154
## A+B+C  0.000000  0.000000  0.000000  1.572038  1.572038  1.572038
##           A+B+C
## A      0.000000
## B      0.000000
## C      0.000000
## A+B    0.348079
## A+C    0.348079
## B+C    0.348079
## A+B+C -4.716113
## 
## Fitted (or set) value of pi:
##         A         B         C       A+B       A+C       B+C     A+B+C 
## 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 
## 
## Log-likelihood: -169.815739 
## 
## Optimization method used was "nlminb"
transient.ordered<-fitpolyMk(tree,x,model="transient",ordered=TRUE)
## 
## This is the design matrix of the fitted model. Does it make sense?
## 
##     A A+B B B+C C
## A   0   2 0   0 0
## A+B 1   0 1   0 0
## B   0   2 0   2 0
## B+C 0   0 1   0 1
## C   0   0 0   2 0
transient.ordered
## Object of class "fitpolyMk".
## 
## Evolution was modeled as 'ordered' (i.e., transitions are assumed
## to occur A <-> A+B <-> B and so on) using the "transient" model.
## 
## Fitted (or set) value of Q:
##             A       A+B         B       B+C         C
## A   -0.865778  0.865778  0.000000  0.000000  0.000000
## A+B  1.663303 -3.326606  1.663303  0.000000  0.000000
## B    0.000000  0.865778 -1.731555  0.865778  0.000000
## B+C  0.000000  0.000000  1.663303 -3.326606  1.663303
## C    0.000000  0.000000  0.000000  0.865778 -0.865778
## 
## Fitted (or set) value of pi:
##   A A+B   B B+C   C 
## 0.2 0.2 0.2 0.2 0.2 
## 
## Log-likelihood: -158.782444 
## 
## Optimization method used was "nlminb"
er.unordered<-fitpolyMk(tree,x,model="ER")
## 
## This is the design matrix of the fitted model. Does it make sense?
## 
##       A B C A+B A+C B+C A+B+C
## A     0 0 0   1   1   0     0
## B     0 0 0   1   0   1     0
## C     0 0 0   0   1   1     0
## A+B   1 1 0   0   0   0     1
## A+C   1 0 1   0   0   0     1
## B+C   0 1 1   0   0   0     1
## A+B+C 0 0 0   1   1   1     0
er.unordered
## Object of class "fitpolyMk".
## 
## Evolution was modeled as 'unordered' using the "ER" model.
## 
## Fitted (or set) value of Q:
##               A         B         C       A+B       A+C       B+C
## A     -1.708008  0.000000  0.000000  0.854004  0.854004  0.000000
## B      0.000000 -1.708008  0.000000  0.854004  0.000000  0.854004
## C      0.000000  0.000000 -1.708008  0.000000  0.854004  0.854004
## A+B    0.854004  0.854004  0.000000 -2.562012  0.000000  0.000000
## A+C    0.854004  0.000000  0.854004  0.000000 -2.562012  0.000000
## B+C    0.000000  0.854004  0.854004  0.000000  0.000000 -2.562012
## A+B+C  0.000000  0.000000  0.000000  0.854004  0.854004  0.854004
##           A+B+C
## A      0.000000
## B      0.000000
## C      0.000000
## A+B    0.854004
## A+C    0.854004
## B+C    0.854004
## A+B+C -2.562012
## 
## Fitted (or set) value of pi:
##         A         B         C       A+B       A+C       B+C     A+B+C 
## 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 
## 
## Log-likelihood: -183.822936 
## 
## Optimization method used was "nlminb"
er.ordered<-fitpolyMk(tree,x,model="ER",ordered=TRUE)
## 
## This is the design matrix of the fitted model. Does it make sense?
## 
##     A A+B B B+C C
## A   0   1 0   0 0
## A+B 1   0 1   0 0
## B   0   1 0   1 0
## B+C 0   0 1   0 1
## C   0   0 0   1 0
er.ordered
## Object of class "fitpolyMk".
## 
## Evolution was modeled as 'ordered' (i.e., transitions are assumed
## to occur A <-> A+B <-> B and so on) using the "ER" model.
## 
## Fitted (or set) value of Q:
##             A       A+B         B       B+C         C
## A   -1.317534  1.317534  0.000000  0.000000  0.000000
## A+B  1.317534 -2.635067  1.317534  0.000000  0.000000
## B    0.000000  1.317534 -2.635067  1.317534  0.000000
## B+C  0.000000  0.000000  1.317534 -2.635067  1.317534
## C    0.000000  0.000000  0.000000  1.317534 -1.317534
## 
## Fitted (or set) value of pi:
##   A A+B   B B+C   C 
## 0.2 0.2 0.2 0.2 0.2 
## 
## Log-likelihood: -161.927196 
## 
## Optimization method used was "nlminb"

We can compare the AIC values of our different models in a table:

data.frame(transition_model=c("ARD","ARD","SYM","SYM","transient","transient","ER","ER"),
    ordered=c("no","yes","no","yes","no","yes","no","yes"),
    logLik=c(logLik(ard.unordered),logLik(ard.ordered),logLik(sym.unordered),
    logLik(sym.ordered),logLik(transient.unordered),logLik(transient.ordered),
    logLik(er.unordered),logLik(er.ordered)),
    k=c(attr(AIC(ard.unordered),"df"),attr(AIC(ard.ordered),"df"),
    attr(AIC(sym.unordered),"df"),attr(AIC(sym.ordered),"df"),
    attr(AIC(transient.unordered),"df"),attr(AIC(transient.ordered),"df"),
    attr(AIC(er.unordered),"df"),attr(AIC(er.ordered),"df")),
    AIC=c(AIC(ard.unordered),AIC(ard.ordered),AIC(sym.unordered),
    AIC(sym.ordered),AIC(transient.unordered),AIC(transient.ordered),
    AIC(er.unordered),AIC(er.ordered)))
##   transition_model ordered    logLik  k      AIC
## 1              ARD      no -154.7551 18 345.5102
## 2              ARD     yes -154.9668  8 325.9335
## 3              SYM      no -160.0297  9 338.0594
## 4              SYM     yes -159.6932  4 327.3864
## 5        transient      no -169.8157  2 343.6315
## 6        transient     yes -158.7824  2 321.5649
## 7               ER      no -183.8229  1 369.6459
## 8               ER     yes -161.9272  1 325.8544

This should tell us that the transient-ordered model is best-justified by the data, which is good - because this is the model I used to produce them:

tree<-pbtree(n=200,scale=1)
Q<-matrix(c(-1,1,0,0,0,
    2,-4,2,0,0,
    0,1,-2,1,0,
    0,0,2,-4,2,
    0,0,0,1,-1),5,5,byrow=TRUE)
rownames(Q)<-colnames(Q)<-c("A","A+B","B","B+C","C")
x<-sim.Mk(tree,Q)

That's it.

1 comment:

  1. Hi Liam
    How can I use this function within make.simmap and can I use it with a probability matrix to include tips that are NA?
    I have tried it with replacing fitMK in make.simmmap.R with fitpolyMK however, I either get "Error in getPars(bt, xx, model, Q = NULL, tree, tol, m, pi = pi, args = list(...)) :
    could not find function "getPars""
    if I only change the make.saimmap fucntion or, if I change all functions with make.simmap.R from thegithub repository to have at every occurence of fitMk fitpolyMK, I get the error " Error in strsplit(x, "+", fixed = TRUE) : non-character argument
    5.
    strsplit(x, "+", fixed = TRUE)
    4.
    (function (tree, x, model = "SYM", ordered = FALSE, ...)
    {
    if (hasArg(quiet))
    quiet <- list(...)$quiet ...
    3.
    do.call(fitpolyMk, args)
    2.
    getPars(bt, xx, model, Q = NULL, tree, tol, m, pi = pi, args = list(...))
    1.
    make.simmap(pruned.tree, mymatrix_2, model = "ARD", nsim = 10,
    tips = T) "
    Thanks for any advice on how to implement it into make.simmap, this is exactly what I was looking for as a function

    ReplyDelete