## Thursday, January 31, 2013

### Note on polytomies and internal branches of zero length in ancestral character estimation

Just a quick note on the use of polytomous trees vs. arbitrarily fully-resolved trees with branches of zero length in phytools functions for ancestral character estimation such as anc.Bayes, anc.ML, anc.trend, and ancThresh (as well as other actions that call these functions internally). Basically, and confusingly unlike functions such as pic in ape, polytomous trees (not internal branches of zero length) should be used. This is because of one specific internal calculation in these functions involving the inversion of the among tip species and internal node variance-covariance matrix computed by the phytools function vcvPhylo. If polytomies are represented using internal branches of zero length then vcvPhylo(tree) returns a singular matrix, which cannot be inverted. By contrast, if polytomies are represented as polytomies, vcvPhylo(tree) is not singular.

Here's an example:
> # create a tree with a polytomy
> plotTree(tree,node.numbers=TRUE)
> # compute among species & node VCV matrix
> C<-vcvPhylo(tree)
> C
A B C D 6
A 2 0 0 0 0
B 0 2 1 1 1
C 0 1 2 1 1
D 0 1 1 2 1
6 0 1 1 1 1
> # invert?
> invC<-solve(C)
> invC # no problem
A  B             C             D  6
A 0.5  0  0.000000e+00  0.000000e+00  0
B 0.0  1  0.000000e+00 -7.401487e-17 -1
C 0.0  0  1.000000e+00 -7.401487e-17 -1
D 0.0  0  4.163336e-17  1.000000e+00 -1
6 0.0 -1 -1.000000e+00 -1.000000e+00  4
> # resolve all nodes
> tree<-multi2di(tree)
> plotTree(tree,node.numbers=TRUE)
> # compute among species & node VCV matrix
> C<-vcvPhylo(tree)
> C
A B C D 6 7
A 2 0 0 0 0 0
B 0 2 1 1 1 1
C 0 1 2 1 1 1
D 0 1 1 2 1 1
6 0 1 1 1 1 1
7 0 1 1 1 1 1
> # invert?
> invC<-solve(C)
Error in solve.default(C) :
Lapack routine dgesv: system is exactly singular: U[6,6]=0

It would seem to make sense to just include an internal step in all of these function in which we use di2multi to collapse any branches of zero length - the problem with this idea is that the consequence would be a disassociation between the node numbers of the original tree and the node numbers for which ancestral character estimates are returned. Yikes! In future versions of these functions I will try to remember to have them spit a meaningful error if a tree with branches of zero length is input - rather than just collapsing as at the present time.

### New version of pgls.Ives

I just posted a new version of the phytools function pgls.Ives, which is a function to fit the regression model of Ives et al. (2007) incorporating sampling error in the estimation of species means (e.g., here).

The main update in this version is that it can now accept individual data, rather than just species means, variances, and covariances. The form for the individual data is the same as fitBayes (e.g., here) - in other words, just long vectors with names equal to the species names. Obviously, some names will repeat - which is allowed in vectors.

There is nothing magic about the way this function works. It just takes the individual data, and then computes the within-species variances and covariances (which are themselves used to estimate within-species sampling variances and covariances). This is done using, for example, R base functions such as aggregate. For those in need of a quick primer, the following demo shows how aggregate can be used to compute within-species variances and covariances for a vector of values x containing individual values with names(x) being the species names, some of which repeat:
> # simulate tree & data
> tree<-pbtree(n=30)
> x<-fastBM(tree) # true means
> # individual values
> xe<-sampleFrom(xhat,randn=c(1,5))
> xe
t13         t13         t13         t13         t25
-1.83130612 -2.08011272 -1.99096486 -3.26950066 -1.17626558
t25         t26         t26         t20         t20
0.57125210 -1.90520230 -1.96856340         ...         ...

> # get species means
> xbar<-aggregate(xe,by=list(names(xe)),FUN=mean)
> xbar<-setNames(xbar[,2],xbar[,1])
> xbar
t1         t10         t11         t12         t13
-1.71666562 -3.92387486 -3.78403128 -1.82452852 -2.29297109
t14         t15         t16         t17         t18
-0.83735741 -0.66027704 -1.84528088        ...         ...

> # get species variances
> xvar<-aggregate(xe,by=list(names(xe)),FUN=var)
> xvar<-setNames(xvar[,2],xvar[,1])
> xvar
t1         t10         t11         t12         t13
0.319766112          NA 1.330431973 0.420867485 0.434420333
t14         t15         t16         t17         t18
0.910668871 0.985273462 0.560029491        ...         ...

Something that's apparent from the little simulation above, is that if there is not at least two samples for a particular species, then the species variance cannot be computed. In that case, the program merely assumes that those species have within-species variance equal to the mean within-species variance - but it will also spit a warning to that effect.

Computing the sampling covariance from individual data requires that the same individuals be sampled in both vectors. Thus, if the argument Cxy is not supplied, pgls.Ives will try to do it. If different individuals were used for x & y, then sampling covariance is probably negligible. In this case - if individual data is being supplied - then one should also give the function a named vector of zeroes for the argument Cxy to avoid any errors of spurious results.

Finally, here's a quick demo showing how to use individual data with pgls.Ives for both zero and non-zero error covariance:
> source("pgls.Ives.R")
>
> # simulate tree & data under the regression model
> tree<-pbtree(n=30)
> xhat<-fastBM(tree)
> yhat<-0.75*xhat+0.2*fastBM(tree)
>
> # simulate individual data for x & y
> # assuming Cxy!=0
> x<-sampleFrom(xhat,randn=c(1,10))
> n<-summary(as.factor(names(x)))
> y<-sampleFrom(yhat,n=n[names(yhat)])
>
> # fit model
> pgls.Ives(tree,x,y)
\$beta
 -1.193712  1.124268
\$sig2x
 0.5043893
\$sig2y
 0.219358
\$a
 3.111079 2.303972
\$logL
 -74.45021
\$convergence
 0
\$message
 "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"

Warning messages:
1: Some species contain only one sample. Substituting mean variance.
2: Some species contain only one sample. Substituting mean variance.
3: Some species contain only one sample. Substituting mean variance.

Download the latest version of this function on the phytools page - or wait a bit and it will be in the next CRAN release.

## Saturday, January 5, 2013

### Adding new tips at random to a phylogeny

A friend and colleague contacted me recently with the request (somewhat abbreviated), below:

What I'm interested in doing is adding a given number of tips at random to a tree. I've done a comparison where I take a phylogeny. . . [and randomly prune taxa]. I'm interested to also do the reverse. . . . Would there be an easy way in phytools to do a "random addition of a given number of tips"?

Well, it is possible to randomly add tips to the tree - just by (for instance) picking nodes at random (from, say, the set of all internal & terminal nodes - excluding the root node), picking a position at random along the corresponding parent edge, and then using the phytools function bind.tip to attach the new tip to the tree.

So, for instance, we could just do:
> require(phytools)
> # simulate a random pure-birth tree
> tree<-pbtree(n=12)
> # pick a node at random
> node<-sample(c(1:length(tree\$tip), 2:tree\$Nnode+length(tree\$tip)),size=1)
> node
 22
> # pick a random position along the branch ending in node
> position<-runif(n=1)* tree\$edge.length[which(tree\$edge[,2]==node)]
> # this just tells us the relative position
> # (from the root)
> # check where the tip will be added
> (tree\$edge.length[which(tree\$edge[,2]==node)]- position)/tree\$edge.length[which(tree\$edge[,2]==node)]
 0.700115
> # now attach new tip
> new.tree<-bind.tip(tree,"t13",where=node, position=position)
> # plot the trees
> par(mfrow=c(2,1))
> plotTree(tree,node.numbers=T)
> plotTree(new.tree,node.numbers=T)
We could do this sequentially a number of times - each time updating the tree as well as the set of nodes in the tree from which we are picking.

One shortcoming of this approach is that it ignores relative branch lengths - added tips are no more likely to be located on long branches than on short ones. Ideally, I think we'd like the probability of adding a tip along a branch to vary in direct proportion to the relative length of each branch. We can do that we taking advantage of the base function cumsum which (obviously) computes a cumulative sum from a vector. I have put this into a new function, add.random, which can be downloaded here. Let's test out the function to see if it behaves as we'd expect.
> # simulate a random, non-ultrametric tree
> tree1<-rtree(20)
> # set one branch to be very long (doesn't matter which)
> tree1\$edge.length[sample(1:nrow(tree1\$edge),size=1)]<-100
> # add new tips at random
> # here, we supply edge lengths - but we don't need to
> # plot both trees
> layout(c(1,2),heights=c(2,3))
> plotTree(tree1,fsize=0.8)
> plotTree(tree2,fsize=0.8)
We can see that nearly all the new tips have been added on the very long branch that we created, which is exactly what we'd hoped.

As in the phytools function bind.tip, when the tree is ultrametric if new tips are added without branch lenghs, branch lengths are set such the tree remains ultrametric with the new tips. For example:
> # simulate pure-birth tree
> tree1<-pbtree(n=20)
> # add new tips at random, without supplying branch
> # lengths
> # plot both trees
> layout(c(1,2),heights=c(2,3))
> plotTree(tree1,fsize=0.8)
> plotTree(tree2,fsize=0.8)
One thing that we can observe about the plots above is that because tips are added sequentially (not somehow all at once to the original tree), the resulting phylogeny can include entire clades (for instance, the group of ((t29,t25),t27)) not found in the original phylogeny.

Finally, we can also supply our own names, so, for instance:
> tree<-pbtree(n=12)
> plotTree(tree)
That's it!

## Tuesday, January 1, 2013

### Processing .tps morphometric files in R

This has absolutely nothing to do with phylogenetics, but today (New Year's day 2013!) I've been starting the year with some R programming to pull data out of the raw output files from a program tpsDIG2 (by Jim Rohlf) that we use to get data from landmarks on digital skeletal x-rays of lizards.

We are just getting linear measurements at the moment (no fancy geometric morphometrics), so our data analysis is pretty simple. Here's what a skeletal x-ray of a lizard with the landmark coordinates overlain looks like (click for larger version):
(There is also a scale bar, not shown.)

The output from tpsDIG looks as follows (some lines omitted):
LM=43
726.00000 1580.00000
773.00000 1457.00000
679.00000 1456.00000
840.00000 1493.00000
840.00000 1463.00000
...
SCALE=0.125544

The first line gives the number of landmarks; subsequent lines give the horizontal & vertical pixel positions for each landmark; and the final lines give the image information & scale. Obviously, in R we want to read in the file, translate it to meaningful (i.e., non-pixel) scale, and perhaps compute our desired linear measurements. We might also want to be able to visually inspect our data for errors.

I have written two functions tps.process (to pull data out of the file and translate the scale); and tps.postProcess, which computes the linear measures we want from our input files. The first function is thus "general" (suitable for many people getting linear measures from .tps files); whereas the latter is specific to our data, although it could probably be modified and re-purposed easily. The code is posted here.

Here's a quick example. Note that I have also included a mode tps.process(...,type="anole") that visualizes the landmark data (for error checking) in a sensible way specifically for our x-rays.

> list.files()
 "KMW_004.tps" "KMW_122.tps" "KMW_317.tps"
 "tps.process.R"
> files<-list.files(pattern=".tps")
> files
 "KMW_004.tps" "KMW_122.tps" "KMW_317.tps"
> source("tps.process.R")
> X<-tps.process(files,type="anole")
OK? Press <Enter> to continue...
OK? Press <Enter> to continue...
OK? Press <Enter> to continue...
One of the plots above is created for each input .tps file so that we can easily review our input data for errors. (You can see it even looks kind of like a lizard!)

> # Here's our data rescaled
> X
V1        V2
1  28.338576 75.445244
2  33.807424 59.287284
3  21.751100 59.784452
4  49.343924 62.643168
5  47.355252 59.287284
6   6.214600 60.157328
...
V1        V2
1  29.126208 80.724792
2  35.026776 65.282880
...
> # now let's automatically post-process
> LL<-tps.postProcess(X)
> LL
rJL      lJL       JW    rMETC    lMETC
KMW_004adj  17.05837 16.98986 12.06657 3.900868 3.890955
KMW_122_adj 16.53086 16.64819 11.80180 3.766320 3.694494
KMW_317adj  15.80653 16.18932 11.41486 3.283738 3.391124
KMW_004adj  8.728803 8.672875  9.743256  9.175747 11.57585
KMW_122_adj 9.094794 9.109513 10.081895 10.313728 12.43139
KMW_317adj  8.239660 8.533229  9.663571  9.765426 12.13156
lHUM     PECT     PELV     rFEM     lFEM
KMW_004adj  12.01011 8.082803 6.840578 16.44463 16.42960
KMW_122_adj 12.19394 7.910268 6.872889 17.09659 16.83975
KMW_317adj  11.85941 7.864227 6.990424 15.13729 15.12956
rTIB     lTIB     rFIB     lFIB   rMETT1
KMW_004adj  13.59564 13.38724 13.41548 13.33405 5.133726
KMW_122_adj 14.58960 14.54415 14.47245 14.17035 5.280316
KMW_317adj  12.61329 12.73256 12.48540 12.61020 4.745144
rMETT2   lMETT1   lMETT2