Phylogenetic Tools for Comparative Biology: Simulating DNA sequences with Γ rate heterogeneity among sites

Friday, September 13, 2013

Simulating DNA sequences with Γ rate heterogeneity among sites

I just added this to the (fairly maligned) function genSeq, for simulating DNA sequences by (inefficiently) wrapping the phytools function sim.history. Here's what the function looks like now:

genSeq<-function(tree,l=1000,Q=NULL,rate=1, format="DNAbin",...){
if(is.null(Q)){
  Q<-matrix(1,4,4)
  rownames(Q)<-colnames(Q)<-c("a","c","g","t")
  diag(Q)<-0
  diag(Q)<--colSums(Q)
}
if(length(rate)!=l){
  if(length(rate)==1) rate<-rep(rate,l)
  else {
   cat("warning: length(rate) & l should match for      length(rate)>1\n")
   cat(" rate will be recycled.\n")
   rate<-rep(rate,ceiling(l/length(rate)))[1:l]
  }
}
cat("simulating sequences....\n")
X<-sapply(rate,function(a,b,c)   sim.history(b,a*c)$states,b=tree,c=Q)
if(format=="DNAbin") return(as.DNAbin(X))
else if(format=="phyDat") return(as.phyDat(X))
else if(format=="matrix") return(X)
}

One can easily see that the way rate heterogeneity is handled is by allowing the user to supply a site-specific vector of rates. Here's what this would look like if we wanted to simulate sequences under a continuous Γ rate heterogeneity model with a shape parameter, α = 0.25 and a mean rate of 1.0 substitution per site per unit of branch length:

> library(phytools)
> tree<-pbtree(n=26,tip.label=LETTERS,scale=0.1)
> gg<-rgamma(n=1000,shape=0.25,rate=0.25)
> X<-genSeq(tree,rate=gg)
simulating sequences....
> X
26 DNA sequences in binary format stored in a matrix.

All sequences of same length: 1000

Labels: A B C D E F ...

Base composition:
a c g t
0.264 0.245 0.242 0.249

We can do invariant sites the same way - although in this case it will break the function if we set any of the rates to zero, so we need to set them very slightly > 0:

> ii<-rep(1,1000)
> tol<-1e-12
> ii[sample(1:1000,100)]<-tol
> X<-genSeq(tree,rate=ii)
simulating sequences....
> X
26 DNA sequences in binary format stored in a matrix.

All sequences of same length: 1000

Labels: A B C D E F ...

Base composition:
a c g t
0.244 0.261 0.238 0.257

As Klaus pointed out, this method is very slow - so you probably have better options. But it's still a fun exercise to see how it works - and offers the interesting possibility that we could store and return all character histories along with the states for tips.

Phylogenetic Tools for Comparative Biology

Pages

Friday, September 13, 2013

Simulating DNA sequences with Γ rate heterogeneity among sites

No comments:

Post a Comment