I was recently playing with simSeq in the phangorn package - but I couldn't get it to do exactly what I wanted (probably for lack of sufficient patience). Then I realized that I could (nearly) just as easily simulate DNA sequences using phytools with the function sim.history. Here's a quick & incredibly simple function that I wrote to do this that wraps around sim.history:
else if(format=="phyDat") return(as.phyDat(X))
else if(format=="matrix") return(X)
Yes, it's really that simple. Now, admittedly this function cannot simulate rate heterogeneity among sites, unequal base frequencies, or invariant sites, but these would be relatively straightforward to add.
Here's, let's try it out:
> ## with a basal root taxon "Z"
> ## now let's simulate under Juke-Cantor
> ## (the default)
26 sequences with 2000 character and 1711 different site patterns.
The states are a c g t
> ## now let's use our data for inference
optimize edge weights: -56156.91 --> -40367.43
optimize edge weights: -40367.43 --> -40367.43
optimize topology: -40367.43 --> -38967.95
optimize topology: -25858.87 --> -25858.87
I unrooted the tree (rooted trees are not yet supported)
> ## measure RF distance to original
> ## plot original and estimated tree
> title("a) Generating tree",adj=0,cex.main=1.2)
> title("b) Estimated tree using ML",adj=0,cex.main=1.2)
That works OK. Adding other attributes typical of molecular sequence models is really easy, so I'll probably do that. Pretty cool.