Phylogenetic Tools for Comparative Biology: Simple DNA sequence simulator using sim.history internally

Thursday, September 12, 2013

Simple DNA sequence simulator using sim.history internally

I was recently playing with simSeq in the phangorn package - but I couldn't get it to do exactly what I wanted (probably for lack of sufficient patience). Then I realized that I could (nearly) just as easily simulate DNA sequences using phytools with the function sim.history. Here's a quick & incredibly simple function that I wrote to do this that wraps around sim.history:

genSeq<-function(tree,l=1000,Q=NULL,rate=1, format="DNAbin",...){
  if(is.null(Q)){
    Q<-matrix(1,4,4)
    rownames(Q)<-colnames(Q)<-c("a","c","g","t")
    diag(Q)<-0
    diag(Q)<--colSums(Q)
  }
  X<-replicate(l,sim.history(tree,rate*Q)$states)
  if(format=="DNAbin") return(as.DNAbin(X))
  else if(format=="phyDat") return(as.phyDat(X))
  else if(format=="matrix") return(X)
}

Yes, it's really that simple. Now, admittedly this function cannot simulate rate heterogeneity among sites, unequal base frequencies, or invariant sites, but these would be relatively straightforward to add.

Here's, let's try it out:

> ## first let's generate a random tree
> ## with a basal root taxon "Z"
> tree<-pbtree(n=25,scale=0.9)
> tree$tip.label<-LETTERS[25:1]
> tree$root.edge<-0
> root<-list(edge=matrix(c(3,1,3,2),2,2,byrow=TRUE),
edge.length=c(1,0.1),
tip.label=c("Z","NA"),
Nnode=1)
> class(root)<-"phylo"
> tree<-paste.tree(root,tree)
>
> ## now let's simulate under Juke-Cantor
> ## (the default)
> X<-genSeq(tree,l=2000,rate=0.1,format="phyDat")
> X
26 sequences with 2000 character and 1711 different site patterns.
The states are a c g t
>
> ## now let's use our data for inference
> library(phangorn)
> obj<-pml(rtree(n=26,tip.label=LETTERS),X)
> fit<-optim.pml(obj,optNni=TRUE)
optimize edge weights: -56156.91 --> -40367.43
optimize edge weights: -40367.43 --> -40367.43
optimize topology: -40367.43 --> -38967.95
...
optimize topology: -25858.87 --> -25858.87
0
Warning message:
I unrooted the tree (rooted trees are not yet supported)
>
> ## measure RF distance to original
> RF.dist(unroot(tree),unroot(fit$tree))
[1] 0
> ## plot original and estimated tree
> par(mfcol=c(2,1))
> plotTree(tree,mar=c(0.1,1.1,3.1,0.1))
> title("a) Generating tree",adj=0,cex.main=1.2)
> plotTree(midpoint(fit$tree),mar=c(0.1,1.1,3.1,0.1))
> title("b) Estimated tree using ML",adj=0,cex.main=1.2)

That works OK. Adding other attributes typical of molecular sequence models is really easy, so I'll probably do that. Pretty cool.

3 comments:

KlausSeptember 13, 2013 at 5:45 AM
Hi Liam,

what was your problem with simSeq? So far simSeq is about 100 times faster and there is still some room for possible improvements.

> system.time(X<-genSeq(tree,l=2000,rate=0.1,format="phyDat"))
user system elapsed
18.628 0.000 18.621
> X
26 sequences with 2000 character and 1688 different site patterns.
The states are a c g t
> system.time(X2<-simSeq(tree,l=2000,rate=0.1))
user system elapsed
0.112 0.044 0.103
> X2
26 sequences with 2000 character and 796 different site patterns.
The states are a c g t

Your use of the rate parameter is a problematic here. I scaled the Q matrix using the formula (13.14) on page 205 from Felsensteins "Inferring phylogenies". Q contains \alpha to \eta in Felsensteins notation. We can re-estimate the rate parameter with optim.pml and the original tree for the simulated data:

obj2<-pml(tree,X)
fit2<-optim.pml(obj,optRate=TRUE, optEdge=FALSE)

obj3<-pml(tree,X2)
fit3<-optim.pml(obj3,optRate=TRUE, optEdge=FALSE)

> fit2$rate
[1] 0.2468205
# too high
> fit3$rate
[1] 0.1011709
# about right and seems to converge nicely
# to 0.1 for larger sequences ;)

simSeq is very rough so far, mainly as a working horse for some bigger simulations. The rate parameter can be used for a (discrete) gamma model. I always wanted to improve it, e.g. allowing rate variation parameters directly or codon models. However a more useful extension would be to give it a pml or pmlPart object and simulate sequences from all model parameter. This could be very handy as part of a very easy-to-use parametric bootstrap function. I will try to find some time to add it to the next version of phangorn.

Cheers,
Klaus

ReplyDelete
Replies

Add comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.

Pages

Thursday, September 12, 2013

Simple DNA sequence simulator using sim.history internally

3 comments: