Friday, February 9, 2024

Bounded Brownian motion & the most important phylogenetic comparative methods paper you've probably never read

I’m going to make a fairly bold assertion.

That assertion is as follows. The most important paper in phylogenetic comparative methods you’ve never heard of (at least from the past decade, or so) is Boucher & Démery’s (2016) article on bounded Brownian motion. You might be excused for not knowing about this paper – in the nearly 8 years since publication it’s only been cited 32 times.

The paper adds an interesting model to the toolkit of phylogenetic comparative biologists: Brownian motion with reflexive bounds. Bounded Brownian motion is an evolutionary scenario that’s easy to conceptualize, trivial to simulate, but (prior to Boucher & Démery’s 2016) effectively impossible to fit to observed data. That, on its own, makes this paper important.

In the abstract of the article, however, the authors mention that they “provide analytical expressions for the likelihood of BBM [bounded Brownian motion] and present a method to calculate the likelihood numerically…” (emphasis mine). To be frank, I didn’t get this at first – but when I finally did, my head just about exploded.

To put it in the simple terms that I myself understand, Boucher & D̩mery merely observe that, in the limit, the probability (density) of a set of data evolving via Brownian motion and an ordered, symmetric, continuous-time Markov chain (Mk) model converge as the step size between adjacent discrete states goes to zero Рand can get quite close to convergence, so long as the step size is small relative to the range of our observed data.

Being someone who is used to teaching about the fundamental underlying differences between the ways in which we model discrete & continuous traits in phylogenetic comparative biology, learning this surprised me at first and triggered a liberal measure of disbelief – however, upon further introspection, at its core this claim is no different from an assertion that Brownian motion and a discrete-time random walk converge as the time interval diminishes to zero! (Something that is rather easy to accept, in fact.) The only difference is that we’ve discretized our character state space, rather than time.

Diving into the weeds only a tiny bit, we can also say that in the limit, the relationship between the transition rate between adjacent states, \(q\), and the Brownian evolutionary rate, \(\sigma^2\), is given by the following expression:

$$ \sigma^2 = 2q \left( \frac{w}{k} \right)^2 $$

In which \(w\) is the upper and lower limits of our trait (either set to be well outside the observed range of the trait, if we’re approximating Brownian motion, or at the trait limits, if we’re modeling bounded evolution), and \(k\) is the number of discrete trait categories that we’ve used to approximate the likelihood. In general (for a given value of \(w\)), as \(k\) is increased we should approach the true likelihood – though not always precisely monotonically!

Astute readers might wonder how this could be true – as when we fit a continuous trait model, we’re maximizing a likelihood defined as the probability density of our observed data (say, \(\mathbf{x}\)) given a set of model parameters (say, \(\sigma^2\) and \(x_0\), in the case of Brownian motion); while, on the other hand, when we fit a discrete trait model, we’ll be maximizing a likelihood defined as the probability (not probability density) of our data, given our model. Indeed, this confused me as well! The answer can be seen when we appreciate that the probability of observing a continuous random variable on an interval can be obtained from a density function by integrating over that interval; and that, in the limit as the interval width approaches zero, the integral of a function is merely its width times the function height. This means that the converse must also be true: that the density can be approximated by taking the probability of a bin & dividing by the bin width.

Using the same terms as above, our log-likelihood is thus approximated as:

$$ \log(L) \approx \log(L’) - N\log\left(\frac{w}{k}\right) $$

In which \(\log(L)\) is our approximation of the log-likelihood density (on our data’s original scale), \(\log(L')\) is the log-likelihood of our discretized continuous data, \(N\) is the number of tips in our tree, and both \(w\) and \(k\) are defined as before. Once again, after transformation, & in the limit as our bin width shrinks (or, equivalently, as \(k \rightarrow \infty\)), this log-likelihood is directly comparable (in other words, the approximation becomes exact) to a log-likelihood density obtained on the original, undiscretized trait data.

So, what’s the big deal? Is it merely that we can now add the single additional model of bounded Brownian motion to our hypothesis arsenal? No! The implication is that we can fit & compare any model in which the back & forth transitions between adjacent discretized character states can be expressed as an ordered, continuous-time Markov chain. This was appreciated by a second very important article that you probably haven’t read: Boucher et al. (2018). (To be fair, this one’s been cited 40 times rather than a mere 32 – so perhaps more people have read it.) Excellent as this article is, I think even it understates the significance & power of this approach. Expect to hear more about this from me later!

Boucher has bounded Brownian motion and other models from his collaborative 2018 article implemented in R software here – but (in case I haven’t made the point sufficiently clear yet) I think this is a very important model for phylogenetic comparative methods, so I’ve also recently added bounded BM (along with some other related functionality, such as ancestral state estimation – more on this later as well) to phytools. (Please check out my code here. It’s remarkably simple!)

I ran into a number of interesting computational challenges during implementation. For example, optimizing the model requires that we calculate the likelihood many times, which in turn requires very large matrix exponentiation – once for each edge of the tree, for each time that we need to compute the likelihood! That’s a lot of large matrix exponentiation. I was able to speed this up by realizing that I could first compute the set of large matrix exponentials for all edges of the tree in parallel using the foreach package, rather than in serial during pruning itself. My laptop has 16 cores, so this makes quite a difference!

To demonstrate this new phytools functionality (called bounded_bm), let’s first ‘prove’ its equivalency to geiger::fitContinuous for unbounded BM (or, rather, bounds much broader than the observed range of the trait). Then we can proceed to demonstrate that we’re able to correctly measure bounded evolution, when this is the evolutionary process that genuinely underlies our trait data. Finally, I’ll apply it to an empirical case of maximum length evolution in eels.

library(phytools)

(This version of phytools must be obtained from GitHub.)

packageVersion("phytools")
## [1] '2.1.11'

We can start off by simulating unbounded Brownian motion using phytools::fastBM.

For no reason at all, I’ll use a genuine empirical tree: the phytools data object "mammal.tree" which comes originally from Garland et al. (1992).

data(mammal.tree)
x<-fastBM(mammal.tree,sig2=0.1,a=1.5)
head(x)
##  U._maritimus     U._arctos U._americanus     N._narica      P._lotor 
##     0.5136932     0.5178826    -1.2364921     2.9156133     4.1869694 
##   M._mephitis 
##     3.5433668

Now, let’s fit our unbounded model with phytools::bounded_bm.

If we don’t specify a particular value for our bounds, the function automatically sets them to 3 \(\times\) the range of observed data – making this (essentially) an unbounded evolutionary scenario. (This optimization can take a while, with most of the computing power consumed by \(500 \times 500\) matrix exponentiation. Boucher & Démery suggest that using a smaller value of levs – the number of discrete categories of the discretized trait, \(k\) above – such as levs = 200, will normally be sufficient. I’ll follow this recommendation in subsequent optimizations of this blog post.)

mammal_unbounded<-bounded_bm(mammal.tree,x,lik.func="parallel",
  root="mle",levs=500)
mammal_unbounded
## Object of class "bounded_bm" based on
##  	a discretization with k = 500 levels.
## 
## Unwrapped (i.e., bounded) model
## 
## Set or estimated bounds: [ -10.346509 , 13.065879 ]
## 
## Fitted model parameters:
## 	 sigsq: 0.080119 
##      x0: 1.055324 
## 
## Log-likelihood: -75.734562 
## 
## R thinks it has found the ML solution.

Now let’s compare this result to geiger::fitContinuous. Be forewarned. If you still feel skeptical about the equivalency (in the limit) of continuous-time Brownian motion and our symmetric, ordered Mk process on the discretized state space, your mind may be about to be blown….

geiger::fitContinuous(mammal.tree,x)
## GEIGER-fitted comparative model of continuous data
##  fitted 'BM' model parameters:
## 	sigsq = 0.080204
## 	z0 = 1.063045
## 
##  model summary:
## 	log-likelihood = -75.764287
## 	AIC = 155.528573
## 	AICc = 155.789443
## 	free parameters = 2
## 
## Convergence diagnostics:
## 	optimization iterations = 100
## 	failed iterations = 0
## 	number of iterations with same best fit = 100
## 	frequency of best fit = 1.000
## 
##  object summary:
## 	'lik' -- likelihood function
## 	'bnd' -- bounds for likelihood search
## 	'res' -- optimization iteration summary
## 	'opt' -- maximum likelihood parameter estimates

(Seriously. What?) If you don’t believe it, or think this similarity of our parameter estimates and log-likelihood could be merely accidental, by all means attempt different generating parameter values and try increasing or decreasing the number of state bins, levs!

Obviously, if we could only fit equivalent models to those of fitContinuous – this exercise would be a mere curiosity. (I’d still be interested, but it might be harder to persuade empiricists to share that enthusiasm.)

The whole point of this post, however, is that this property of our discretized Markov model is that it permits us to fit models for which the likelihood function is unknown, cannot be analytically derived, or for which computing the likelihood is simply impractical. Bounded Brownian motion is just one example of such a model.

Conveniently, our previous simulator (phytools::fastBM) can also simulate for the bounded case. Let’s test it out.

y<-fastBM(mammal.tree,sig2=0.4,a=1,bounds=c(0,5))
head(y)
##  U._maritimus     U._arctos U._americanus     N._narica      P._lotor 
##     3.6066062     3.6149849     0.1062355     2.4257928     4.9685050 
##   M._mephitis 
##     3.6812999
range(y)
## [1] 0.1062355 4.9685050

Boucher & Démery assert that the ML solution for the bounds of a bounded Brownian process will always be the trait range. This makes sense.

In our simulation, however, we happen to know the generating range (we set it), so we can fix the bounds where we know them to be & we should expect estimation to be slightly improved. (I suspect that the former is rather more common in nature than the latter, but I could be wrong.)

When I fit my model, I’m going to update one other aspect – I’ll set root="nuisance" instead of root="mle". (This can be explained in another post.)

mammal_bounded<-bounded_bm(mammal.tree,y,lims=c(0,5),
  lik.func="parallel",root="nuisance",levs=200)
mammal_bounded
## Object of class "bounded_bm" based on
##  	a discretization with k = 200 levels.
## 
## Unwrapped (i.e., bounded) model
## 
## Set or estimated bounds: [ 0 , 5 ]
## 
## Fitted model parameters:
## 	 sigsq: 0.405921 
##      x0: 2.470428 
## 
## Log-likelihood: -75.125155 
## 
## R thinks it has found the ML solution.

Now let’s compare to unbounded Brownian motion fit using geiger::fitContinuous.

geiger::fitContinuous(mammal.tree,y)
## GEIGER-fitted comparative model of continuous data
##  fitted 'BM' model parameters:
## 	sigsq = 0.175411
## 	z0 = 2.430212
## 
##  model summary:
## 	log-likelihood = -94.936936
## 	AIC = 193.873871
## 	AICc = 194.134741
## 	free parameters = 2
## 
## Convergence diagnostics:
## 	optimization iterations = 100
## 	failed iterations = 0
## 	number of iterations with same best fit = 100
## 	frequency of best fit = 1.000
## 
##  object summary:
## 	'lik' -- likelihood function
## 	'bnd' -- bounds for likelihood search
## 	'res' -- optimization iteration summary
## 	'opt' -- maximum likelihood parameter estimates

We should typically find that our bounded model fits better, and provides a much better estimate of the generating value of \(\sigma^2\), than does its unbounded counterpart!

Finally, let’s try applying this to a real empirical dataset, rather than an empirical tree combined with simulated data (as we’ve been doing so far).

For this example, I’ll use a phylogeny and dataset for maximum total length in eels from Collar et al. (2014) – and only because the length of long skinny things, like snakes & eels, always struck me as something that might evolve by a bounded Brownian process. (Don’t @ me eel biologists.)

data(eel.tree)
data(eel.data)
max_TL<-setNames(eel.data$Max_TL_cm,rownames(eel.data))
eel_bounded<-bounded_bm(eel.tree,max_TL,lims=range(max_TL),
  lik.func="parallel",root="mle")
eel_bounded
## Object of class "bounded_bm" based on
##  	a discretization with k = 200 levels.
## 
## Unwrapped (i.e., bounded) model
## 
## Set or estimated bounds: [ 15 , 250 ]
## 
## Fitted model parameters:
## 	 sigsq: 188.95333 
##      x0: 15.5875 
## 
## Log-likelihood: -322.74907 
## 
## R thinks it has found the ML solution.
eel_bm<-geiger::fitContinuous(eel.tree,max_TL,model="BM")
eel_bm
## GEIGER-fitted comparative model of continuous data
##  fitted 'BM' model parameters:
## 	sigsq = 113.836440
## 	z0 = 85.702289
## 
##  model summary:
## 	log-likelihood = -338.935210
## 	AIC = 681.870419
## 	AICc = 682.077316
## 	free parameters = 2
## 
## Convergence diagnostics:
## 	optimization iterations = 100
## 	failed iterations = 0
## 	number of iterations with same best fit = 100
## 	frequency of best fit = 1.000
## 
##  object summary:
## 	'lik' -- likelihood function
## 	'bnd' -- bounds for likelihood search
## 	'res' -- optimization iteration summary
## 	'opt' -- maximum likelihood parameter estimates

Lastly, let’s compare both of these fitted to an alternative constraint model – the Ornstein-Uhlenbeck.

eel_ou<-geiger::fitContinuous(eel.tree,max_TL,model="OU")
eel_ou
## GEIGER-fitted comparative model of continuous data
##  fitted 'OU' model parameters:
## 	alpha = 0.043393
## 	sigsq = 278.039659
## 	z0 = 92.068484
## 
##  model summary:
## 	log-likelihood = -329.253814
## 	AIC = 664.507628
## 	AICc = 664.928681
## 	free parameters = 3
## 
## Convergence diagnostics:
## 	optimization iterations = 100
## 	failed iterations = 0
## 	number of iterations with same best fit = 2
## 	frequency of best fit = 0.020
## 
##  object summary:
## 	'lik' -- likelihood function
## 	'bnd' -- bounds for likelihood search
## 	'res' -- optimization iteration summary
## 	'opt' -- maximum likelihood parameter estimates
AIC(eel_bounded,eel_bm,eel_ou)
##             df      AIC
## eel_bounded  4 653.4981
## eel_bm       2 681.8704
## eel_ou       3 664.5076

This tells us that – even accounting for the increased parameter complexity of the model – bounded Brownian motion is a substantially better explanation of our eel maximum body length data than the other two models we’ve considered. Neat!

That’s all I really have to say about this, but here’s a cool plot illustrating bounded Brownian evolution for when I post about this on Twitter!

data("salamanders")
h<-max(nodeHeights(salamanders))
mt<-make.era.map(salamanders,seq(0,h,length.out=100))
st<-map.to.singleton(mt)
x<-fastBM(st,sig2=0.8,internal=TRUE,bounds=c(-20,20))
par(mar=c(5.1,4.1,1.1,1.1))
phenogram(st,x,ftype="i",fsize=0.7,
  mar=c(5.1,4.1,1.1,1.1),lwd=4,las=1,cex.axis=0.8)
segments(x=c(0,0),y=c(-20,20),x1=c(h,h),y0=c(-20,20),
  lwd=2,col=palette()[2])
clip(par()$usr[1],h,min(x),max(x))
grid()
phenogram(st,x,ftype="off",add=TRUE,lwd=2,
  color="white")

plot of chunk unnamed-chunk-18

Thanks for reading, y’all!

No comments:

Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.