Today, an R phylogenetics user asked the question:

*“I am interested in determining how many times a trait was gained across
a phylogeny. So I simulated the evolution of the trait (using simmap in
phytools) and I can look at the mean number of gains across in the
simulations, which in my case was ~20. However, I would also like to be
able to point at the nodes where the trait was likely gained on my tree.
For this, I simply rounded the posterior probability at each node to either
0 or 1. However, this only gives me 18 nodes where the trait was likely
gained. I think I can see why there is a difference between those numbers;
although there are 20 gains on average, since those gains aren't always
going to occur on the same nodes, then there aren't going to be 20 nodes
with a probability of gaining the trait >0.5. My question is (1) is my
logic right as to why that discrepancy exists, and (2) if I want to say
'this trait likely evolved X number of times', which is the more correct
way to get that number? Should I report the mean/median (maybe with
confidence intervals), or does it make more sense to look at the posterior
probability at each node and get a number that way?”*

OK.

So in stochastic character mapping, what we're doing is *sampling* plausible
character histories that are consistent with our tip data, under an
evolutionary model.

In *phytools* the default way to do this is to estimate a value for the
transition matrix, **Q**, using likelihood, and the sampling trait histories
with stochastic simulations given this value of **Q** that are consistent
with our observed data. [This isn't the only way to do this in *phytools*.
We could have also using MCMC to first sample **Q** from its posterior
probability distribution, and then used these values to generate our
character histories.]

This results in a *distribution* of trait changes on the tree, rather than
a specific *number* of changes, as we might obtain using a non-statistical
method such as parsimony. For a given number of changes, these different histories
can also imply different sets of changes (in either direction) on the
phylogeny.

Here I'm going to use a real example to give a sense of what I mean. These data are from Benun Sutton & Wilson (2019), and I also use them in my upcoming book with Luke Harmon.

```
library(phytools)
## load data
bonyfish.tree<-read.tree(file="http://www.phytools.org/Rbook/7/bonyfish.tre")
bonyfish.tree
```

```
##
## Phylogenetic tree with 90 tips and 89 internal nodes.
##
## Tip labels:
## Xenomystus_nigri, Chirocentrus_dorab, Talismania_bifurcata, Alepocephalus_tenebrosus, Misgurnus_bipartitus, Opsariichthys_bidens, ...
##
## Rooted; includes branch lengths.
```

```
bonyfish.data<-read.csv(file="http://www.phytools.org/Rbook/7/bonyfish.csv",
row.names=1,stringsAsFactors=TRUE)
head(bonyfish.data)
```

```
## spawning_mode paternal_care
## Xenomystus_nigri pair male
## Chirocentrus_dorab group none
## Talismania_bifurcata group none
## Alepocephalus_tenebrosus group none
## Misgurnus_bipartitus pair none
## Opsariichthys_bidens pair none
```

```
plotTree.datamatrix(bonyfish.tree,X=bonyfish.data,yexp=1.1,fsize=0.5)
```

We're going to analyze just one trait, *spawning mode*, so I'll pull that out
to proceed.

```
spawning_mode<-setNames(bonyfish.data$spawning_mode,
rownames(bonyfish.data))
```

Now let's generate our stochastic character maps as follows.

```
smaps.spawning_mode<-make.simmap(bonyfish.tree,spawning_mode,
model="ARD",nsim=1000,pi="fitzjohn")
```

```
## make.simmap is sampling character histories conditioned on
## the transition matrix
##
## Q =
## group pair
## group -0.0003478837 0.0003478837
## pair 0.0018750505 -0.0018750505
## (estimated using likelihood);
## and (mean) root node prior probabilities
## pi =
## group pair
## 0.002330267 0.997669733
```

```
## Done.
```

```
smaps.spawning_mode
```

```
## 1000 phylogenetic trees with mapped discrete characters
```

The first thing we can do is compute a posterior probability distribution
on the number of changes of each type. This gives us a measure of both the
expected number of changes, and the *dispersion* or variance of changes under
our assumed model.

Doing this in *phytools* is easy. We just use the `density`

method for our
object class as follows.

```
dd<-density(smaps.spawning_mode)
dd
```

```
##
## Distribution of changes from stochastic mapping:
## group->pair pair->group
## Min. :0 Min. :9
## Median :0 Median :10
## Mean :0.28 Mean :10.3
## Max. :4 Max. :12
##
## 95% HPD interval(group->pair): [0, 1]
## 95% HPD interval(pair->group): [10, 11]
```

```
plot(dd,colors=c("black","orange"),alpha=0.5)
```

This shows us that most of our stochastic character map histories show no, or very few, character changes from group to pair spawning, but between 9 and 12 changes from pair to group spawning.

To get the number of changes on each & every one of our 1,000 stochastic mapped
trees, I'm going to run our `summary`

method again, but this time separately on
each `"simmap"`

object in my posterior sample, instead of on the `"multiSimmap"`

object as a whole.

```
ss<-lapply(smaps.spawning_mode,summary)
head(ss,2)
```

```
## [[1]]
## 1 tree with a mapped discrete character with states:
## group, pair
##
## tree has 10 changes between states
##
## changes are of the following types:
## group pair
## group 0 0
## pair 10 0
##
## mean total time spent in each state is:
## group pair total
## raw 679.2357201 5681.0307286 6360.266
## prop 0.1067936 0.8932064 1.000
##
##
## [[2]]
## 1 tree with a mapped discrete character with states:
## group, pair
##
## tree has 10 changes between states
##
## changes are of the following types:
## group pair
## group 0 0
## pair 10 0
##
## mean total time spent in each state is:
## group pair total
## raw 818.6079802 5541.6584685 6360.266
## prop 0.1287066 0.8712934 1.000
```

In this case, I see that the first two of my stochastic maps both have 10 changes, and that all of these changes are from pair to group spawning.

It's easy enough to get this tally for all 1,000 stochastically mapped trees.

```
n.changes<-t(sapply(ss,function(x) c(x$Tr[2,1],x$Tr[1,2])))
dimnames(n.changes)<-list(1:length(smaps.spawning_mode),
c("pair->group","group->pair"))
head(n.changes,30)
```

```
## pair->group group->pair
## 1 10 0
## 2 10 0
## 3 10 0
## 4 10 0
## 5 10 0
## 6 10 0
## 7 10 0
## 8 10 0
## 9 10 0
## 10 10 0
## 11 11 1
## 12 11 1
## 13 10 0
## 14 10 0
## 15 11 2
## 16 10 0
## 17 10 0
## 18 10 0
## 19 10 0
## 20 10 0
## 21 10 0
## 22 10 0
## 23 11 0
## 24 10 1
## 25 10 1
## 26 12 0
## 27 10 0
## 28 10 0
## 29 10 0
## 30 10 0
```

Just inspecting the first 30 trees, we see that the most common history involves
10 changes from pair to group spawning, and zero changes in the reverse direction;
however, our stochastic histories can *sometimes* involve as many as 12 changes,
as well as a small number of changes in the *opposite* direction.

Let's graph some examples of these.

First, a stochastic character history with 10 “forward” (pair to group) changes and zero changes in the opposite direction.

```
## set colors
cols<-setNames(c("black","orange"),c("pair","group"))
## set map
i<-7
plot(smaps.spawning_mode[[i]],cols,ftype="i",fsize=0.6,
outline=TRUE,lwd=3,ylim=c(-5,Ntip(bonyfish.tree)))
markChanges(smaps.spawning_mode[[i]],
setNames(c("black","black"),names(cols)),cex=0.5,lwd=5)
markChanges(smaps.spawning_mode[[i]],cols,cex=0.5,lwd=3)
text(paste("Number of changes:",sum(n.changes[i,])),
x=mean(par()$usr[1:2]),
y=-4,cex=1.5)
```

Now, an example with 11 forward changes, and 2 reversals.

```
i<-749
plot(smaps.spawning_mode[[i]],cols,ftype="i",fsize=0.6,
outline=TRUE,lwd=3,ylim=c(-5,Ntip(bonyfish.tree)))
markChanges(smaps.spawning_mode[[i]],
setNames(c("black","black"),names(cols)),cex=0.5,lwd=5)
markChanges(smaps.spawning_mode[[i]],cols,cex=0.5,lwd=3)
text(paste("Number of changes:",sum(n.changes[i,])),
x=mean(par()$usr[1:2]),
y=-4,cex=1.5)
```

Lastly, a rare example with only 9 forward changes (and one reversal).

```
i<-637
plot(smaps.spawning_mode[[i]],cols,ftype="i",fsize=0.6,
outline=TRUE,lwd=3,ylim=c(-5,Ntip(bonyfish.tree)))
markChanges(smaps.spawning_mode[[i]],
setNames(c("black","black"),names(cols)),cex=0.5,lwd=5)
markChanges(smaps.spawning_mode[[i]],cols,cex=0.5,lwd=3)
text(paste("Number of changes:",sum(n.changes[i,])),
x=mean(par()$usr[1:2]),
y=-4,cex=1.5)
```

*All* of these character histories are consistent with our tip data and
have been sampled in proportion to their probability under a model of
character evolution that we specified.

One interesting, but totally logical(!!), aspect of this analysis is that
even though the lattermost character history involves exactly the same
number of character changes as the first one we inspected, it is actually
must less probable, and thus sampled much less frequently by our method.
Why? Because the only way it can result is if there is a pair to group
spawning transition along the very short edge of the tree leading to the
common ancestor of the *Polyprion* through *Echensis* clade. Neat!

## No comments:

## Post a Comment

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.