Thursday, September 12, 2013

Simple DNA sequence simulator using sim.history internally

I was recently playing with simSeq in the phangorn package - but I couldn't get it to do exactly what I wanted (probably for lack of sufficient patience). Then I realized that I could (nearly) just as easily simulate DNA sequences using phytools with the function sim.history. Here's a quick & incredibly simple function that I wrote to do this that wraps around sim.history:

genSeq<-function(tree,l=1000,Q=NULL,rate=1, format="DNAbin",...){
  if(format=="DNAbin") return(as.DNAbin(X))
  else if(format=="phyDat") return(as.phyDat(X))
  else if(format=="matrix") return(X)

Yes, it's really that simple. Now, admittedly this function cannot simulate rate heterogeneity among sites, unequal base frequencies, or invariant sites, but these would be relatively straightforward to add.

Here's, let's try it out:

> ## first let's generate a random tree
> ## with a basal root taxon "Z"
> tree<-pbtree(n=25,scale=0.9)
> tree$tip.label<-LETTERS[25:1]
> tree$root.edge<-0
> root<-list(edge=matrix(c(3,1,3,2),2,2,byrow=TRUE),
> class(root)<-"phylo"
> tree<-paste.tree(root,tree)
> ## now let's simulate under Juke-Cantor
> ## (the default)
> X<-genSeq(tree,l=2000,rate=0.1,format="phyDat")
> X
26 sequences with 2000 character and 1711 different site patterns.
The states are a c g t
> ## now let's use our data for inference
> library(phangorn)
> obj<-pml(rtree(n=26,tip.label=LETTERS),X)
> fit<-optim.pml(obj,optNni=TRUE)
optimize edge weights: -56156.91 --> -40367.43
optimize edge weights: -40367.43 --> -40367.43
optimize topology: -40367.43 --> -38967.95
optimize topology: -25858.87 --> -25858.87
Warning message:
I unrooted the tree (rooted trees are not yet supported)
> ## measure RF distance to original
> RF.dist(unroot(tree),unroot(fit$tree))
[1] 0
> ## plot original and estimated tree
> par(mfcol=c(2,1))
> plotTree(tree,mar=c(0.1,1.1,3.1,0.1))
> title("a) Generating tree",adj=0,cex.main=1.2)
> plotTree(midpoint(fit$tree),mar=c(0.1,1.1,3.1,0.1))
> title("b) Estimated tree using ML",adj=0,cex.main=1.2)

That works OK. Adding other attributes typical of molecular sequence models is really easy, so I'll probably do that. Pretty cool.


  1. Hi Liam,

    what was your problem with simSeq? So far simSeq is about 100 times faster and there is still some room for possible improvements.

    > system.time(X<-genSeq(tree,l=2000,rate=0.1,format="phyDat"))
    user system elapsed
    18.628 0.000 18.621
    > X
    26 sequences with 2000 character and 1688 different site patterns.
    The states are a c g t
    > system.time(X2<-simSeq(tree,l=2000,rate=0.1))
    user system elapsed
    0.112 0.044 0.103
    > X2
    26 sequences with 2000 character and 796 different site patterns.
    The states are a c g t

    Your use of the rate parameter is a problematic here. I scaled the Q matrix using the formula (13.14) on page 205 from Felsensteins "Inferring phylogenies". Q contains \alpha to \eta in Felsensteins notation. We can re-estimate the rate parameter with optim.pml and the original tree for the simulated data:

    fit2<-optim.pml(obj,optRate=TRUE, optEdge=FALSE)

    fit3<-optim.pml(obj3,optRate=TRUE, optEdge=FALSE)

    > fit2$rate
    [1] 0.2468205
    # too high
    > fit3$rate
    [1] 0.1011709
    # about right and seems to converge nicely
    # to 0.1 for larger sequences ;)

    simSeq is very rough so far, mainly as a working horse for some bigger simulations. The rate parameter can be used for a (discrete) gamma model. I always wanted to improve it, e.g. allowing rate variation parameters directly or codon models. However a more useful extension would be to give it a pml or pmlPart object and simulate sequences from all model parameter. This could be very handy as part of a very easy-to-use parametric bootstrap function. I will try to find some time to add it to the next version of phangorn.


    1. Hi Klaus.

      Yes, this is slow - it wraps around sim.history which simulates single characters up the branches & nodes of the tree.

      Yes, my problem was that I was trying to supply simSeq with a full Q matrix (i.e., including the diagonal) rather than the upper diagonal as a vector. It worked - but it didn't produce the expected result for obvious reasons. (This is a documentation problem - not a problem with simSeq - because you describe Q as the rate matrix, but actually want a vector corresponding to the upper diagonal.)

      I wanted to simulate data where the rate of transitions was 1000 or 10000 × higher than transversions (this is just for the purpose of concept demonstration in a class). I'll have to check Felsenstein to get the indices α through η as you say. Is it the upper triangle by row?

      I'm not sure what's problematic (other than perhaps terminological) about the use of the parameter rate. I did this so that Q could be scaled arbitrarily, but it's not particularly important. You can ignore it and just set Q.

      Thanks Klaus.

      All the best, Liam

    2. Hi Liam,

      so you probably wanted to do something like this (K2P, ts/tv ratio of 1000):
      X2<-simSeq(tree,l=2000,rate=0.1, Q=c(1,1000,1,1,1000,1))
      or this (HKY):
      X2<-simSeq(tree,l=2000,rate=0.1, Q=c(1,1000,1,1,1000,1), bf = c(.3,.2,.2,.3))

      Take care the order of the nucleotides is A,C,G,T in simSeq and A,G,C,T in Felsensteins book. Just to make things a bit more confusing. I am not sure of how the Q matrix/vextor I use should be called correctly.

      The scaling of Q is important as you have to either fix the edge length or the rates. Than edge lengths can be interpreted as expected number of substitution per site and are comparable to seq-gen and similar software.

      I added the rate parameter mainly to model a discrete gamma models, see the example from simSeq:

      > (rates <- phangorn:::discrete.gamma(1,4))
      [1] 0.1369538 0.4767519 1.0000000 2.3862944
      > mean(rates)
      [1] 1
      > data1 <- simSeq(tree, l = 500, type="AA", model="WAG", rate=rates[1])
      > data2 <- simSeq(tree, l = 500, type="AA", model="WAG", rate=rates[2])
      > data3 <- simSeq(tree, l = 500, type="AA", model="WAG", rate=rates[3])
      > data4 <- simSeq(tree, l = 500, type="AA", model="WAG", rate=rates[4])
      > data <- c(data1,data2, data3, data4)

      This may be interesting for your class and I probably should make discrete.gamma public in the NAMESPACE.

      For a small example speed is not important, especially as examples on your blog focus on how functions can be written easily. However I often got complaints of how slow R is, even for example optim.pml is as fast or faster as phyML. Of course most people don not take into account how fast you can write a small function or how easy profiling code is in R. That is the reason I am a bit worried to put code out which is not the fastest, as I do not want to give people reasons to confirm their prejudices. In fact simSeq is mainly faster because I avoid some loops as the sampling is vectorized.

      Have a nice weekend


Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.