## Monday, August 15, 2016

### Variance & confidence intervals on reconstructed tip states with `anc.ML`

A phytools user recently contacted me with the following query:

“In short, is it possible to find confidence intervals for ancestral states reconstructed with `anc.ML`? We know `fastAnc` gives us the variance and confidence intervals, but we have been using `anc.ML` for our continuous variables because we usually have data missing from our tip states. We'd like to be able to use confidence intervals for the reconstructed nodes (`model\$ace`) and tips (`model\$missing.x`) in our analysis.”

Theoretically, it should not be too hard to obtain variances (& thus confidence intervals) by first computing the Hessian matrix of 2nd-order partial derivatives of the likelihood surface at the MLE. The negative inverse of this matrix has the asymptotic property of being an estimator of the covariance matrix for our MLEs.

I have now added this to phytools.

Here is a quick demo with missing data:

``````library(phytools)
packageVersion("phytools")
``````
``````## [1] '0.5.47'
``````
``````tree
``````
``````##
## Phylogenetic tree with 26 tips and 25 internal nodes.
##
## Tip labels:
##  A, B, C, D, E, F, ...
##
## Rooted; includes branch lengths.
``````
``````x.sampled
``````
``````##          X          H          P          J          M          T
## -1.2380228  1.0293725 -0.8456503 -1.9916634 -2.8730554  0.3547209
##          G          O          D          I          W          Z
##  1.4444012  0.8178155  2.1005946 -2.0430705  0.6952082 -0.3092633
##          U          V          Y          S          F          K
##  1.6612110  1.0018500 -1.8732164  0.2649342  2.8013645 -0.8186988
##          B          N
##  0.3439659 -2.7633222
``````
``````fit<-anc.ML(tree,x.sampled,CI=TRUE)
fit
``````
``````## Ancestral character estimates using anc.ML under a BM model:
##        27        28        29        30        31        32        33
##  0.311053  0.391244  0.523655  0.999723  1.110276  0.848005  2.006952
##        34        35        36        37        38        39        40
##  1.485751  2.172200  1.266575  0.235682 -0.495487 -1.696335 -1.534325
##        41        42        43        44        45        46        47
## -2.001946 -1.499331 -2.736365 -0.177092  0.732131  0.539801  0.539860
##        48        49        50        51
##  0.330409  0.996167 -0.868859 -1.424747
##
## Lower & upper 95% CIs:
##        lower     upper
## 27 -0.669159  1.291264
## 28 -0.515972  1.298459
## 29 -0.272219  1.319530
## 30  0.216615  1.782832
## 31  0.313083  1.907468
## 32  0.005303  1.690707
## 33  1.593210  2.420694
## 34  0.700476  2.271026
## 35  1.377463  2.966936
## 36  0.826782  1.706368
## 37 -0.542467  1.013830
## 38 -1.216217  0.225242
## 39 -2.212909 -1.179761
## 40 -1.949853 -1.118797
## 41 -2.117523 -1.886370
## 42 -1.919639 -1.079023
## 43 -2.959644 -2.513085
## 44 -0.789921  0.435737
## 45  0.064848  1.399414
## 46 -0.147178  1.226780
## 47 -0.509807  1.589527
## 48  0.076263  0.584555
## 49  0.390736  1.601598
## 50 -1.447851 -0.289866
## 51 -1.762894 -1.086600
##
## Fitted model parameters & likelihood:
##     sig2 log-likelihood
##  0.43949      -18.88231
##
## R thinks it has found the ML solution.
``````

The print method does not indicate whether some tips were missing, and if those tip values have been retained; however these can be obtained from the fitted fitect:

``````str(fit)
``````
``````## List of 10
##  \$ sig2        : Named num 0.439
##   ..- attr(*, "names")= chr ""
##  \$ ace         : Named num [1:25] 0.311 0.391 0.524 1 1.11 ...
##   ..- attr(*, "names")= chr [1:25] "27" "28" "29" "30" ...
##  \$ logLik      : num -18.9
##  \$ counts      : Named int [1:2] 98 98
##   ..- attr(*, "names")= chr [1:2] "function" "gradient"
##  \$ convergence : int 0
##  \$ message     : chr "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
##  \$ model       : chr "BM"
##  \$ CI95        : num [1:25, 1:2] -0.669 -0.516 -0.272 0.217 0.313 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..\$ : chr [1:25] "27" "28" "29" "30" ...
##   .. ..\$ : NULL
##  \$ missing.x   : Named num [1:6] 0.848 2.007 2.172 -1.499 0.54 ...
##   ..- attr(*, "names")= chr [1:6] "A" "C" "E" "L" ...
##  \$ missing.CI95: num [1:6, 1:2] -0.56 1.412 0.918 -2.271 -0.594 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..\$ : chr [1:6] "A" "C" "E" "L" ...
##   .. ..\$ : NULL
##  - attr(*, "class")= chr "anc.ML"
``````
``````fit\$missing.x
``````
``````##          A          C          E          L          Q          R
##  0.8478992  2.0069709  2.1721374 -1.4993467  0.5398655  0.5398655
``````
``````fit\$missing.CI95
``````
``````##         [,1]       [,2]
## A -0.5598708  2.2556692
## C  1.4120210  2.6019208
## E  0.9184274  3.4258474
## L -2.2708686 -0.7278247
## Q -0.5937081  1.6734391
## R -0.5937081  1.6734391
``````

That's really all there is to it.

Here are these values overlain on a `contMap` style plot:

``````obj<-contMap(tree,x=c(x.sampled,fit\$missing.x),method="user",
anc.states=fit\$ace,plot=FALSE)
plot(obj)
arrl=0.1,hedl=0.05,col="blue")
``````

You might be able to see that the missing tips (marked in blue) have reconstructed states at the tips that are equivalent to the internal parent node value preceding them. This is exactly what we hope for under BM where the expected change is zero along any edge.

Note that the reason analytic solutions are used by `fastAnc` instead of the Hessian matrix is because (in general) for small trees these CIs will be too small. Theoretically, this problem should decrease for larger datasets.

The tree and character data for this example were simulated as follows:

``````library(phytools)
## simulate tree & data
x<-fastBM(tree<-pbtree(n=26,tip.label=LETTERS))
## simulate missing data
x.sampled<-sample(x,20)
x.sampled
``````

Finally, the newest development version of phytools can always be installed from GitHub using devtools as follows:

``````library(devtools)
install_github("liamrevell/phytools")
``````