Phylogenetic Tools for Comparative Biology: A discrete-character-dependent multi-θ Ornstein-Uhlenbeck model in phytools

I’m still working out the bugs, but I just added a new function, fitmultiOU, to phytools 2.6-1 on GitHub for fitting a discrete-character-dependent multi-regime Ornstein-Uhlenbeck using the discrete diffusion approximation of Boucher & Démery (2016) – that we also describe in great detail in a new bioRxiv pre-print. I encourage anyone interested in this powerful tool to go & check out our pre-print!

fitmultiOU builds off of the post I created a couple of weeks ago on this blog showing how to calculate the likelihood under a simple, single- \(\theta\) OU model on a tree using the discrete approximation. fitmultiOU just takes this a step further in exactly the same way as fitmultiBM and fitmultiTrend.

As previously noted, I’m definitely still working out the kinks. Indeed, I wouldn’t trust it farther than I can throw it. Hopefully that will change soon. Nonetheless, preliminary tests are promising.

To see this, let’s start by simulating a discrete character on a tree of 1000 taxa that assumes two different levels, a and b, as follows. (I’m choosing to simulate on a relatively large tree on purpose just to give us the best chance of recovering our generating parameter values if our method works.)

library(phytools)

## Loading required package: ape

## Loading required package: maps

N<-1000
tree<-pbtree(n=N,scale=10)
tree

## 
## Phylogenetic tree with 1000 tips and 999 internal nodes.
## 
## Tip labels:
##   t358, t660, t787, t788, t659, t210, ...
## 
## Rooted; includes branch length(s).

q<-0.2
Q<-matrix(c(-q,q,q,-q),2,2,
  dimnames=list(letters[1:2],letters[1:2]))
Q

##      a    b
## a -0.2  0.2
## b  0.2 -0.2

sim_tree<-sim.history(tree,Q,anc="a")

## Done simulation(s).

cols<-setNames(hcl.colors(n=2),letters[1:2])
plot(sim_tree,cols,ftype="off",lwd=1,
  direction="upwards")
par(lend=1)
legend("bottomleft",letters[1:2],lwd=2,
  col=hcl.colors(n=2),cex=0.6,bty="n")

plot of chunk unnamed-chunk-7

To be clear, we've generated this trait history for simulation -- but for estimation, we'll use only the tip states & the tree.

Next, let’s choose our different simulation parameters. So far, I’m only allowing \(\theta\), the position of th “optimum” in an OU model, to vary as a function of the discrete trait. Nonetheless, our simulator, multiOU, requires that we supply vectors for \(\alpha\), \(\sigma^2\), and \(\theta\), even if we intend to vary only \(\theta\) by discrete character regime.

I’m going to set \(\alpha = 0.3\), which corresponds to a phylogenetic “half-life” of \(\log(2) / \alpha \approx 2.3\) (our total tree height is 10), \(\sigma^2 = 0.1\), and \(\theta = [-0.5, 1.5]\) for the two different regimes.

alpha<-setNames(rep(0.3,2),letters[1:2])
alpha

##   a   b 
## 0.3 0.3

sig2<-setNames(rep(0.1,2),letters[1:2])
sig2

##   a   b 
## 0.1 0.1

theta<-setNames(c(-0.5,1.5),letters[1:2])
theta

##    a    b 
## -0.5  1.5

Now we’re ready for our simulation.

multiOU allows us to separately specify the root state and the root regime, but because I’m not sure these are separately estimable, I’m going to set them to be equal to one another. (That is, I’ll assume that the initial condition of the simulation for the continuous trait is on the “optimum” as prescribed by the discrete character condition at the root.)

## get the root state
root_state<-getStates(sim_tree,"nodes")[1]
root_state

## 1001 
##  "a"

Now I can simulate my continuous trait, x.

x<-multiOU(sim_tree,alpha,sig2,theta,a0=theta[root_state])
head(x)

##     t358     t660     t787     t788     t659     t210 
## 1.336066 1.620262 1.661155 1.800279 1.728212 1.764376

Our analysis is going to use as input the tip states for our discrete trait, \(y\) – not, importantly, the regimes we generated using sim.history, as mentioned previously.

Since only the object sim_tree has this information, we should pull it off using phytools::getStates.

y<-as.factor(getStates(sim_tree,"tips"))
head(y)

## t358 t660 t787 t788 t659 t210 
##    b    b    b    b    b    b 
## Levels: a b

As a final step, I’m going to try to identify reasonable starting values for our optimization process. This will just make finding the MLEs a bit more efficient.

init<-setNames(
  c(
    mean(x[y==levels(y)[1]]),mean(x[y==levels(y)[2]]),
    log(2)/max(nodeHeights(tree)),
    var(x)/max(nodeHeights(tree)),
    fitMk(tree,y)$rates),
  c("theta[a]","theta[b]","alpha","sigsq","q[1]"))
init

##    theta[a]    theta[b]       alpha       sigsq        q[1] 
## -0.01785477  0.80715486  0.06931472  0.05112673  0.23197867

Now let’s try to optimize our model using fitmultiOU. I’ll turn on trace to get periodic progress reports about our optimization process. As a word of warning, this takes a while. (The following analysis took several hours to run on my laptop.)

fit_mou<-fitmultiOU(tree,x,y,model="ER",levs=100,
  parallel=TRUE,root="mle",init=init,trace=1)

## iter	the[a]	the[b]	alpha	sigsq	q[1]	log(L)
## 0	-0.0179	0.8072	0.0693	0.0511	0.2320	-932.8691 
## 100	-0.8977	1.1648	0.3001	0.0832	0.2364	-724.9863 
## 200	-0.6454	1.2625	0.3260	0.0865	0.2239	-724.2291 
## 300	-0.4044	1.3876	0.3480	0.0904	0.2229	-723.3257 
## 400	-0.4038	1.4186	0.3340	0.0856	0.2210	-722.8809 
## 500	-0.3719	1.4540	0.3327	0.0834	0.2263	-722.6922 
## 600	-0.3316	1.6840	0.3190	0.0774	0.2268	-721.7071 
## 700	-0.3316	1.6656	0.3145	0.0814	0.2383	-720.9715 
## 800	-0.3259	1.6652	0.3145	0.0817	0.2379	-720.9686 
## 874	-0.3266	1.6667	0.3141	0.0816	0.2378	-720.9684 
## Done optimizing.

Here's our fitted model.

fit_mou

## Object of class "fitmultiOU" based on
##     a discretization with k = 100 levels.
## 
## Fitted multi-theta OU model parameters:
##  levels: [ a, b ]
##   theta: [ -0.3266, 1.6667 ]
##   alpha: 0.3141 
##   sigsq: 0.0816 
## 
## Estimated Q matrix:
##           a         b
## a -0.237812  0.237812
## b  0.237812 -0.237812
## 
## Log-likelihood: -720.9684 
## 
## R thinks it has found the ML solution.

Well, that’s pretty uncanny as a first try. Again, fitmultiOU is a prototype and is shared only with a warning to use at one’s own risk. Nonetheless, I think it’s right, so that’s pretty cool.

For when I tweet about this, here’s a visualization of multi-regime OU (in general, not this specific case). The code for this visualization comes from a different recent post to this blog.

phy<-pbtree(n=40,scale=10)
tt<-map.to.singleton(make.era.map(phy,
  limits=seq(0,10,length.out=1001)))
Q<-matrix(c(-0.2,0.2,0.2,-0.2),2,2,
  dimnames=list(letters[1:2],letters[1:2]))
s.tt<-sim.history(tt,Q,anc="a")

## Done simulation(s).

cols<-setNames(hcl.colors(n=2),letters[1:2])
theta<-setNames(c(-0.5,1.5),letters[1:2])
sigsq<-setNames(rep(0.1,2),letters[1:2])
alpha<-setNames(rep(0.3,2),letters[1:2])
xx<-multiOU(s.tt,alpha=alpha,sig2=rep(sigsq,2),
  theta=theta,a0=theta["a"],internal=TRUE)
layout(matrix(c(1,2),2,1),heights=c(0.4,0.6))
plot(s.tt,cols,ftype="off",mar=c(1.1,4.1,2.1,2.1),
  xlim=c(0,11))
mtext("a)",adj=0,line=0)
par(mar=c(5.1,4.1,2.1,2.1))
phenogram(s.tt,xx,ftype="off",
  colors=make.transparent(cols,0.5),
  xlim=c(0,11),cex.axis=0.8,las=1)
mtext("b)",adj=0,line=1.5)
for(i in 1:length(theta)){
  stat_theta<-dnorm(seq(min(xx),max(xx),length.out=200),
    mean=theta[i],sd=sqrt(sigsq[i]/(2*alpha[i])))
  stat_theta<-c(0,stat_theta,0)
  stat.norm<-stat_theta/max(stat_theta)*1
  polygon(x=stat.norm+10,
    y=c(min(xx),seq(min(xx),max(xx),
    length.out=200),max(xx)),border=FALSE,
    col=make.transparent(cols[i],0.5))
}

plot of chunk unnamed-chunk-19

Phylogenetic Tools for Comparative Biology

Pages

Saturday, June 13, 2026

A discrete-character-dependent multi-θ Ornstein-Uhlenbeck model in phytools

No comments:

Post a Comment