Tuesday, March 19, 2024

A high-dimensional continuous trait evolution model with Γ distributed rates among traits

Today I had a very nice meeting with a grad student from one of those universities across town in Cambridge.

She had a type of dataset that I’ve seen before consisting of numeric variables for many (i.e., thousands of) traits, but for relatively few taxa – only about 25 species, related by a phylogeny. In her case, the data came from gene expression.

We talked about a few different ideas on what to do with the data – e.g., fitting simple models or measuring phylogenetic signal – but one thing she hadn’t thought of before was the idea of analyzing the data in aggregate, in other words, more like a nucleotide sequence alignment, rather than as many individual traits.

For example, I suggested that we might fit a Brownian model in which, rather than being fixed, \(\sigma^2\) was modeled as a random variable drawn from a \(\Gamma\) distribution, much in the same way that we do nucleotide sites. We would then be free to (for example) compare a model with a single \(\Gamma\) across all sites, to one in which \(\sigma^2\) came from a different distribution for (say) metabolic genes compared to (say) maintenance genes, and so on. (OK, it’s probably pretty obvious that I don’t know much about what types of genes there are, but you get the idea!)

I told her that this was probably easier than she thought, so, to prove it, I’ve set out here to build a model for the independent evolution of an arbitrarily large number of independent, continuous traits (say, gene expression measures for thousands of genes) in which the gene-wise values of Brownian motion \(\sigma^2\) are distributed as a normalized \(\Gamma\) random variable with shape parameter \(\alpha\) (and scaling term \(\sigma^2\), which is validly interpeted as the average rate of evolution across genes).

First, let’s simulate under this model. Here, I’m going to imagine two different data arrays, each containing the expression values for 2,000 genes. I’m going to create one such array with low heterogeneity in the rate, \(\sigma^2\), of gene expression evolution among genes (e.g., \(\alpha = 20\)), and a second data array with high heterogeneity among genes (e.g., \(\alpha = 0.4\)).

ngenes<-2000
alpha1<-20
sig2_1<-1.5
sigsq1<-rgamma(n=ngenes,shape=alpha1,scale=1/alpha1)*sig2_1
alpha2<-0.4
sig2_2<-0.75
sigsq2<-rgamma(n=ngenes,shape=alpha2,scale=1/alpha2)*sig2_2
par(mfrow=c(2,1),mar=c(5.1,4.1,2.1,2.1))
hist(sigsq1,breaks=20,main="",las=1,cex.axis=0.8,cex.lab=0.9,
  xlab=expression(sigma^2))
mtext("a) low rate heterogeneity",adj=0,line=1)
hist(sigsq2,breaks=20,main="",las=1,cex.axis=0.8,cex.lab=0.9,
  xlab=expression(sigma^2))
mtext("b) high rate heterogeneity",adj=0,line=1)

plot of chunk unnamed-chunk-13

Be attentive to the difference in the scale of the x axis between the two plots!

Now, let’s simulate. For this I’ll use the phytools functions pbtree and fastBM as follows.

library(phytools)
tree<-pbtree(n=25,scale=1)
foo<-function(sig2,tree) fastBM(tree=tree,sig2=sig2)
x1<-sapply(sigsq1,foo,tree=tree)
x2<-sapply(sigsq2,foo,tree=tree)

To get a sense of what we’ve created, the following is a pair of visualizations of just the first 100 genes in each of our two data arrays.

phylo.heatmap(tree,x1[,1:100],ftype="off",labels=FALSE,colors=hcl.colors(n=40))

plot of chunk unnamed-chunk-15

phylo.heatmap(tree,x2[,1:100],ftype="off",labels=FALSE,colors=hcl.colors(n=40))

plot of chunk unnamed-chunk-16

The difference should be quite striking! In the first of our two data arrays, every trait has a highly similar degree of variation. By contrast, in the second plot some genes vary a great deal in their expression – while many vary not at all!

OK. Now we’re ready to proceed and fit our model. Here’s my function. I’ve tried to include a few comments to help the reader make sense of what’s going on!

bm_gamma<-function(tree,X,ncat=10,trace=1,...){
  tol<-if(hasArg(tol)) list(...)$tol else 1e-12
  ## get optional arguments
  lower<-if(hasArg(lower)) list(...)$lower else c(tol,0.1)
  upper<-if(hasArg(upper)) list(...)$upper else c(Inf,Inf)
  ## sort the input data
  X<-X[tree$tip.label,]
  X<-as.matrix(X)
  ## compute the among species covariance matrices
  C_full<-vcvPhylo(tree)
  C<-C_full[tree$tip.label,tree$tip.label]
  ## compute the ancestral states at the root (it
  ## will be easier if we subtract these values out)
  C_ho<-rbind(matrix(0,1,Ntip(tree),
    dimnames=list(Ntip(tree)+1,tree$tip.label)),
    C_full[as.character(2:tree$Nnode+Ntip(tree)),
      tree$tip.label])
  one<-matrix(1,Ntip(tree),1)
  I<-diag(one)
  w<-as.vector(t(as.numeric(t(one)%*%solve(C)%*%
      one)^(-1)*one)%*%solve(C))
  x0<-apply(X,2,function(x,w) sum(x*w),w=w)
  ## here's our likelihood function
  ## ncat gives the number of categories of the discrete gamma
  lik_gamma<-function(par,C,X,x0,ncat=10){
    sig2<-par[1]
    alpha<-par[2]
    Rates<-qgamma(seq(1/(2*ncat),1,by=1/ncat),shape=alpha,
      scale=1/alpha)
    Rates<-Rates/mean(Rates)*sig2
    logL<-0
    X<-t(apply(X,1,function(x,x0) x-x0,x0=x0))
    logL<-0
    for(i in 1:ncol(X)){
      lik<-0
      for(j in 1:length(Rates)){
        lik<-lik+sum(dmnorm(X[,i],varcov=Rates[j]*C))/ncat
      }
      logL<-logL+log(lik)
    }
    -logL
  }
  init.sig2<-mean(apply(X,1,
    function(x) mean(x^2)))/max(C)*runif(n=1,0,2)
  init.alpha<-runif(n=1,0.5,2)
  fit<-optim(c(init.sig2,init.alpha),lik_gamma,C=C,X=X,x0=x0,
    ncat=ncat,method="L-BFGS-B",lower=c(0.1,0.1),upper=c(Inf),
    control=list(trace=trace))
  lik<-function(sig2,alpha) lik_gamma(c(sig2,gamma),C=C,
    X=X,x0=x0,ncat=ncat)
  object<-list(sig2=fit$par[1],
    alpha=fit$par[2],
    x0=x0,
    logLik=-fit$value,
    counts=fit$counts,
    convergence=fit$convergence,
    message=fit$message,
    lik=lik)
  class(object)<-"bm_gamma"
  object
}
print.bm_gamma<-function(x){
  cat("Fitted object of class \"bm_gamma\".\n\n")
  cat("Maximum likelihood estimates....\n")
  cat(paste("            sigma-squared:",signif(x$sig2,5),"\n"))
  cat(paste("                    alpha:",signif(x$alpha,5),"\n\n"))
  cat(paste("log-likelihood (df = ",length(x$x0)+2,") : ",
    signif(x$logLik,8),"\n\n",sep=""))
  if(x$convergence==0) cat("R thinks optimization has converged.\n") else
    cat("Optimization may not have converged.\n")
}

Now let’s try to run it. We can start with our low rate heterogeneity (i.e., high \(\alpha\)) data array, x1.

fit_x1<-bm_gamma(tree,x1,trace=1)
## iter   10 value 44206.298053
## final  value 44205.805732 
## converged

Our generating value of \(\alpha\) for this simulation was \(\alpha = 20\), while our scale factor, or average rate, \(\sigma^2\), was \(\sigma^2 = 1.25\). Let’s see how our estimated values compare.

fit_x1
## Fitted object of class "bm_gamma".
## 
## Maximum likelihood estimates....
##             sigma-squared: 1.4563 
##                     alpha: 18.684 
## 
## log-likelihood (df = 2002) : -44205.806
## 
## R thinks optimization has converged.

(Our degrees of freedom is 2,002 because of the two parameters, \(\alpha\) and \(\sigma^2\) from the rate heterogeneity model, plus the 2,000 root states. This may seem like a lot of parameters to estimate, but it’s barely more than half what we would need to fit a separate Brownian model to each trait!)

Now let’s go to our high rate heterogeneity (i.e., low \(\alpha\)) character array, x2.

fit_x2<-bm_gamma(tree,x2,trace=1)
## iter   10 value -8434.179860
## final  value -8434.846987 
## converged

This time, our generating value of \(\alpha\) was \(\alpha = 0.4\), while our scale factor, \(\sigma^2\), was set to \(\sigma^2 = 0.75\). Once again, let’s see how our estimated values compare.

fit_x2
## Fitted object of class "bm_gamma".
## 
## Maximum likelihood estimates....
##             sigma-squared: 0.60105 
##                     alpha: 0.18695 
## 
## log-likelihood (df = 2002) : 8434.847
## 
## R thinks optimization has converged.

Remarkably, these are pretty close to (or, at minimum, in the general direction of) the generating values in both cases! Not bad.

Now, I’ve already found that there can be some numerical issues in which the log-likelihood evaluates to -Inf which will cause optim to fail, so please consider this to be very much a work in progress. Nonetheless, it’s pretty cool as a starting point.

That’s it.

No comments:

Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.