Friday, September 13, 2013

Simulating DNA sequences with Γ rate heterogeneity among sites

I just added this to the (fairly maligned) function genSeq, for simulating DNA sequences by (inefficiently) wrapping the phytools function sim.history. Here's what the function looks like now:

genSeq<-function(tree,l=1000,Q=NULL,rate=1, format="DNAbin",...){
 if(is.null(Q)){
  Q<-matrix(1,4,4)
  rownames(Q)<-colnames(Q)<-c("a","c","g","t")
  diag(Q)<-0
  diag(Q)<--colSums(Q)
 }
 if(length(rate)!=l){
  if(length(rate)==1) rate<-rep(rate,l)
  else {
   cat("warning: length(rate) & l should match for      length(rate)>1\n")
   cat(" rate will be recycled.\n")
   rate<-rep(rate,ceiling(l/length(rate)))[1:l]
  }
 }
 cat("simulating sequences....\n")
 X<-sapply(rate,function(a,b,c)   sim.history(b,a*c)$states,b=tree,c=Q)
 if(format=="DNAbin") return(as.DNAbin(X))
 else if(format=="phyDat") return(as.phyDat(X))
 else if(format=="matrix") return(X)
}

One can easily see that the way rate heterogeneity is handled is by allowing the user to supply a site-specific vector of rates. Here's what this would look like if we wanted to simulate sequences under a continuous Γ rate heterogeneity model with a shape parameter, α = 0.25 and a mean rate of 1.0 substitution per site per unit of branch length:

> library(phytools)
> tree<-pbtree(n=26,tip.label=LETTERS,scale=0.1)
> gg<-rgamma(n=1000,shape=0.25,rate=0.25)
> X<-genSeq(tree,rate=gg)
simulating sequences....
> X
26 DNA sequences in binary format stored in a matrix.

All sequences of same length: 1000

Labels: A B C D E F ...

Base composition:
    a    c    g    t
0.264 0.245 0.242 0.249

We can do invariant sites the same way - although in this case it will break the function if we set any of the rates to zero, so we need to set them very slightly > 0:

> ii<-rep(1,1000)
> tol<-1e-12
> ii[sample(1:1000,100)]<-tol
> X<-genSeq(tree,rate=ii)
simulating sequences....
> X
26 DNA sequences in binary format stored in a matrix.

All sequences of same length: 1000

Labels: A B C D E F ...

Base composition:
    a    c    g    t
0.244 0.261 0.238 0.257

As Klaus pointed out, this method is very slow - so you probably have better options. But it's still a fun exercise to see how it works - and offers the interesting possibility that we could store and return all character histories along with the states for tips.

No comments:

Post a Comment