Friday, July 20, 2012

Newer nonstatic version of phytools, with a bug fix and a new function to rescale SIMMAP style trees

I just posted a newer nonstatic version of 'phytools' (v 0.1-85) with a bug fix to the function export.as.xml (which creates an input file for the stand-alone program SIMMAP v1.5), as well as new function to rescale SIMMAP style modified "phylo" objects in R, sensibly called rescaleSimmap. A direct link to the package build is here

First, the bug fix. Yesterday, I realized that the traditional semi-colon had been cropped from the end of the Newick string in XML format. This makes sense, to some extent, because the end of the string is indicated by a </tree> tag anyway. Pulling the semicolon off the end of the tree string created by ape::write.tree is not quite as simple as it sounds. That is because in R a character string is stored as a single object, rather than a vector of individual characters. To do it, I used strsplit to split up the string, and then paste to put it back together (although in hindsight, I might have used toString instead). This was done basically as follows:

temp<-write.tree(tree) # write Newick to temp
temp<-unlist(strsplit(temp,NULL)) # split into character vector
# paste together, dropping the last character (a ";")
temp<-paste(temp[1:(length(temp)-1)],collapse="")
# write to file with tags
write(paste("\t\t",temp,"",sep=""),file,append=TRUE)


I also added a new function, rescaleSimmap (code here), which rescales a SIMMAP style "phylo" object in R. This function is analagous to rescaleTree in the 'geiger' package. This function takes the input tree and a new total depth (height) for the tree, and then rescales the branch lengths and the mapping elements ($maps and $mapped.edge) proportionally. I programmed this today because I discovered that (at least by default), for a given input tree and character value, SIMMAP samples not only mutational histories, but also total tree depth. Not sure why this is - I may investigate.

Thursday, July 19, 2012

Bug in export.as.xml

In trying to use export.as.xml to export data and trees to read into SIMMAP I have discovered a small bug. Evidently (flaunting years of convention) the Newick-style string containing the tree structure in XML format does not contain a semi-colon character. Who new? I will fix this tomorrow.

New nonstatic version of phytools (v 0.1-84)

I just built (for myself) and posted (for any interested 'phytools' users) a new non-static version of phytools (0.1-84). A direct link to the build is here. To install, just download and then type:

> install.packages("phytools_0.1-84.tar.gz",type="source", repos=NULL)

The package has two new functions over the previous non-static version: export.as.xml (described here) and pgls.Ives (described here), as well as probably a few more new function relative to the latest CRAN release of phytools (v 0.1-8).

Tuesday, July 17, 2012

PGLS regression with sampling error

Following some discussion of the topic on the R-sig-phylo special interest group email listserve, I decided to try and program the PGLS regression method for sampling error in X & Y described originally in a paper by Tony Ives, Peter Midford, and Ted Garland in Systematic Biology (Ives et al. 2007). Even though Ives et al. worked out all the hard parts, this was not a completely trivial undertaking as some of the details were a little difficult to work through. In addition, I first tried to program this as custom correlation structure to use with nlme::gls. It's wholly possible that one might be able to do this in theory, but I was not able to (although I did learn a bit about how these correlation structures are written).

In the end, I did this the old fashioned way, and it seems to work (at least, it converges on standard PGLS when sampling error is zero; and it recovers estimates close to the generating parameter values when sampling error is included for in simulation and accounted for during estimation). Unfortunately, I have so far only programmed bivariate regression. It is theoretically straightforward to extend to multivariable regression, with the caveat that optimizing the likelihood function requires inversion of a covariance matrix whose dimensions scale with N × (p+1) for p independent variables. The optimization is also multidimensional because we simultaneously need to maximize the likelihood the for the rates of evolution in x and yx2, σy2), the regression slope (b1) and the states at the root for x and y (a). We don't need to separately estimate b0 - we can just get it from the root values and the slope.

Code for the function is here. Let's try it:

> # require dependencies and load source
> require(phytools)
> source("pgls.Ives.R")
> # simulate a phylogeny
> tree<-pbtree(n=200,scale=1)
> # simulate under some arbitrary regression model
> x<-fastBM(tree)
> y<-5+3.5*x+fastBM(tree,0.2)
> # create vectors of zeroes for the "no error" case
> Vx0<-Vy0<-Cxy0<-rep(0,length(tree$tip))
> names(Vx0)<-names(Vy0)<-names(Cxy0)<-tree$tip.label
> # fit the no error model with simulated data
> res1<-pgls.Ives(tree,x,y,Vx0,Vy0,Cxy0)
> res1
$beta
[1] 4.874868 3.392148
$sig2x
[1] 0.9823115
$sig2y
[1] 1.008072
$a
[1] -0.2179666 4.1354924
$logL
[1] -224.2079
> # for reference, we can also fit this with gls()
> res2<-gls(y~x,data.frame(x,y),correlation=corBrownian(1,tree))
> res2
Generalized least squares fit by REML
  Model: y ~ x
  Data: data.frame(x, y)
  Log-restricted-likelihood: -115.1862
Coefficients:
(Intercept)           x
   4.874825    3.392136
Correlation Structure: corBrownian
Formula: ~1
Parameter estimate(s):
numeric(0)
Degrees of freedom: 200 total; 198 residual
Residual standard error: 1.00909
> # simulate sampling variances and covariances
> Vx<-rexp(length(tree$tip))/2
> Vy<-rexp(length(tree$tip))/2
> Cxy=sqrt(Vx*Vy)*runif(length(tree$tip),min=-1,max=1)
> names(Vx)<-names(Vy)<-names(Cxy)<-names(x)
> # simulate data with sampling error
> xe<-ye<-vector()
> for(i in 1:length(tree$tip)){
  temp<-mvrnorm(mu=c(x[i],y[i]),
Sigma=matrix(c(Vx[i],Cxy[i],Cxy[i],Vy[i]),2,2))
  xe[i]<-temp[1]; ye[i]<-temp[2]
  }
> names(xe)<-names(x)
> names(ye)<-names(y)
> # fit no error model to data with error
> res3<-pgls.Ives(tree,xe,ye,Vx0,Vy0,Cxy0)
> res3
$beta
[1] 4.1621231 0.2000734
$sig2x
[1] 27.1866
$sig2y
[1] 24.35597
$a
[1] -0.2924528  4.1036111
$logL
[1] -874.7336
> # fit error model
> res4<-pgls.Ives(tree,xe,ye,Vx,Vy,Cxy)
> res4
$beta
[1] 5.133185 3.485432
$sig2x
[1] 0.9533415
$sig2y
[1] 0.830365
$a
[1] -0.2760699  4.1709624
$logL
[1] -568.1378


Obviously, in this case our estimate - particularly of the regression slope - is quite bad when sampling error is ignored; but we do much better with the method of Ives et al. Cool!