tag:blogger.com,1999:blog-8499895524521663926.post6854734058563171052..comments2020-06-03T16:24:47.488-04:00Comments on Phylogenetic Tools for Comparative Biology: Simple DNA sequence simulator using sim.history internallyLiam Revellhttp://www.blogger.com/profile/04314686830842384151noreply@blogger.comBlogger3125tag:blogger.com,1999:blog-8499895524521663926.post-89729770598985718192013-09-13T13:28:59.355-04:002013-09-13T13:28:59.355-04:00Hi Liam,
so you probably wanted to do something l...Hi Liam,<br /><br />so you probably wanted to do something like this (K2P, ts/tv ratio of 1000): <br />X2<-simSeq(tree,l=2000,rate=0.1, Q=c(1,1000,1,1,1000,1))<br />or this (HKY):<br />X2<-simSeq(tree,l=2000,rate=0.1, Q=c(1,1000,1,1,1000,1), bf = c(.3,.2,.2,.3))<br /><br />Take care the order of the nucleotides is A,C,G,T in simSeq and A,G,C,T in Felsensteins book. Just to make things a bit more confusing. I am not sure of how the Q matrix/vextor I use should be called correctly. <br /><br />The scaling of Q is important as you have to either fix the edge length or the rates. Than edge lengths can be interpreted as expected number of substitution per site and are comparable to seq-gen and similar software. <br /><br />I added the rate parameter mainly to model a discrete gamma models, see the example from simSeq:<br /><br />> (rates <- phangorn:::discrete.gamma(1,4))<br />[1] 0.1369538 0.4767519 1.0000000 2.3862944<br />> mean(rates) <br />[1] 1 <br />> data1 <- simSeq(tree, l = 500, type="AA", model="WAG", rate=rates[1])<br />> data2 <- simSeq(tree, l = 500, type="AA", model="WAG", rate=rates[2])<br />> data3 <- simSeq(tree, l = 500, type="AA", model="WAG", rate=rates[3])<br />> data4 <- simSeq(tree, l = 500, type="AA", model="WAG", rate=rates[4])<br />> data <- c(data1,data2, data3, data4)<br /><br />This may be interesting for your class and I probably should make discrete.gamma public in the NAMESPACE.<br /><br />For a small example speed is not important, especially as examples on your blog focus on how functions can be written easily. However I often got complaints of how slow R is, even for example optim.pml is as fast or faster as phyML. Of course most people don not take into account how fast you can write a small function or how easy profiling code is in R. That is the reason I am a bit worried to put code out which is not the fastest, as I do not want to give people reasons to confirm their prejudices. In fact simSeq is mainly faster because I avoid some loops as the sampling is vectorized. <br /><br />Have a nice weekend <br />KlausKlaushttps://www.blogger.com/profile/11021989593338482289noreply@blogger.comtag:blogger.com,1999:blog-8499895524521663926.post-46602598395240793122013-09-13T07:20:44.585-04:002013-09-13T07:20:44.585-04:00Hi Klaus.
Yes, this is slow - it wraps around sim...Hi Klaus.<br /><br />Yes, this is slow - it wraps around sim.history which simulates single characters up the branches & nodes of the tree.<br /><br />Yes, my problem was that I was trying to supply simSeq with a full Q matrix (i.e., including the diagonal) rather than the upper diagonal as a vector. It worked - but it didn't produce the expected result for obvious reasons. (This is a documentation problem - not a problem with simSeq - because you describe Q as the rate matrix, but actually want a vector corresponding to the upper diagonal.)<br /><br />I wanted to simulate data where the rate of transitions was 1000 or 10000 × higher than transversions (this is just for the purpose of concept demonstration in a class). I'll have to check Felsenstein to get the indices α through η as you say. Is it the upper triangle by row?<br /><br />I'm not sure what's problematic (other than perhaps terminological) about the use of the parameter rate. I did this so that Q could be scaled arbitrarily, but it's not particularly important. You can ignore it and just set Q.<br /><br />Thanks Klaus.<br /><br />All the best, LiamLiam Revellhttps://www.blogger.com/profile/04314686830842384151noreply@blogger.comtag:blogger.com,1999:blog-8499895524521663926.post-22733821608999619762013-09-13T05:45:51.608-04:002013-09-13T05:45:51.608-04:00Hi Liam,
what was your problem with simSeq? So f...Hi Liam, <br /><br />what was your problem with simSeq? So far simSeq is about 100 times faster and there is still some room for possible improvements.<br /><br />> system.time(X<-genSeq(tree,l=2000,rate=0.1,format="phyDat"))<br /> user system elapsed <br /> 18.628 0.000 18.621 <br />> X<br />26 sequences with 2000 character and 1688 different site patterns.<br />The states are a c g t <br />> system.time(X2<-simSeq(tree,l=2000,rate=0.1))<br /> user system elapsed <br /> 0.112 0.044 0.103 <br />> X2<br />26 sequences with 2000 character and 796 different site patterns.<br />The states are a c g t <br /><br />Your use of the rate parameter is a problematic here. I scaled the Q matrix using the formula (13.14) on page 205 from Felsensteins "Inferring phylogenies". Q contains \alpha to \eta in Felsensteins notation. We can re-estimate the rate parameter with optim.pml and the original tree for the simulated data: <br /><br />obj2<-pml(tree,X)<br />fit2<-optim.pml(obj,optRate=TRUE, optEdge=FALSE)<br /><br />obj3<-pml(tree,X2)<br />fit3<-optim.pml(obj3,optRate=TRUE, optEdge=FALSE)<br /><br />> fit2$rate <br />[1] 0.2468205 <br /># too high <br />> fit3$rate <br />[1] 0.1011709 <br /># about right and seems to converge nicely<br /># to 0.1 for larger sequences ;)<br /><br />simSeq is very rough so far, mainly as a working horse for some bigger simulations. The rate parameter can be used for a (discrete) gamma model. I always wanted to improve it, e.g. allowing rate variation parameters directly or codon models. However a more useful extension would be to give it a pml or pmlPart object and simulate sequences from all model parameter. This could be very handy as part of a very easy-to-use parametric bootstrap function. I will try to find some time to add it to the next version of phangorn. <br /><br />Cheers, <br />Klaus<br /><br />Klaushttps://www.blogger.com/profile/11021989593338482289noreply@blogger.com