Wednesday, February 26, 2014

Rphylip mostly done....

I've been continuing to plug away at Rphylip, our new R interface for Joe Felsenstein's phylogeny methods package PHYLIP. More information about this effort can be found here: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. The latest additions are Rkitsch, an interface for KITSCH (a least-squares & ME inference program with a clock constraint); Rrestdist, an interface for RESTDIST (a program for distance calculation from restriction site or fragment data); Rrestml, an interface for RESTML (a restriction site ML tree inference program); and Rclique, an interface for CLIQUE (phylogeny inference from binary characters by the compatibility method). I also added a new object class, "rest.data", for restriction data, as well as some methods to convert to the class & print.

Rphylip is on GitHub, and the latest build can be downloaded & installed from source.

Here's a quick demo of Rrestml using the demo dataset from the RESTML documentation in PHYLIP:

> require(Rphylip)
Loading required package: Rphylip
Loading required package: ape
> packageVersion("Rphylip")
[1] ‘0.1.20’
> data(restriction.data)
> restriction.data
13 restriction site scores for 5 species stored in a object of class "rest.data".

All sequences of same length: 13

Number of restriction enzymes used to generate the data: 2

Labels: Alpha Beta Gamma Delta Epsilon

> mltree<-Rrestml(restriction.data,quiet=TRUE)

Restriction site Maximum Likelihood method, version 3.695


  Recognition sequences all 6 bases long

Sites absent from all species are assumed to have been omitted


     +2        
  +--1 
  |  |  +4        
  |  +--2 
  |     +5        
  | 
  3----3        
  | 
  +1        


remember: this is an unrooted tree!

Ln Likelihood =   -40.31850


Between  And  Length   Approx. Confidence Limits
-------  ---  ------   ------- ---------- ------
   3        1  0.01396  (     zero,     0.04907)
   1     2     0.00064  (     zero,    infinity)
   1        2  0.05872  (     zero,     0.12666) **
   2     4     0.01447  (     zero,     0.04458) **
   2     5     0.00100  (     zero,    infinity)
   3     3     0.10801  (  0.01151,     0.21877) **
   3     1     0.01046  (     zero,     0.04404)

     *  = significantly positive, P < 0.05
     ** = significantly positive, P < 0.01

> plot(mltree,no.margin=TRUE,type="unrooted",edge.width=2)

OK. That's all for now. We're still working on documentation & examples, but hopefully we will have a version on CRAN before too long.

New faster version of ltt (& ltt95) for ultrametric trees

Some phytools users reported problems with the phytools function ltt95 (for plotting a 95% high probability LTT from a posterior sample of trees). This is likely due to the use of ltt internally, which is very slow. The reason it is slow is because it does not assume that the tree is ultrametric (& is probably unnecessarily slow even so, but that's a problem for another day). If we first check if the tree is ultrametric we can call branching.times internally which is fast, and everything else is sped along considerably.

I have now done that. The updated code is here & it is also part of a new phytools build (here) which can be downloaded & installed from source. When the trees in the posterior sample are ultrametric, the speed-up that results is really extraordinary (about 10× for a tree with 100 taxa).

Tuesday, February 18, 2014

New Rphylip functions: Rgendist & Rdolpenny

Rphylip, an R interface to the PHYLIP software package by Joe Felsenstein (1989, 2013), reached an entirely meaningless milestone today: it now contains R interfaces for exactly* 2/3 of the programs in the PHYLIP package. (*By some accounting. I have included the program THRESHML, which is not yet part of PHYLIP; and there is also a couple of programs that have been counted for which we will probably not write interfaces.) That means that Rphylip now has 24 different interface functions, as well as a number of other helper functions (including some that add new functionality, such as opt.Rdnaml.) The two latest additions to the family are Rdolpenny (an interface for DOLPENNY) and Rgendist (an interface for GENDIST).

Here's a quick demo of the latter, Rgendist, which is for the calculation of genetic distances from gene frequency data. The data used here (in X) are from the test data in the GENDIST documentation:

> X
           locus 1 locus 2 locus 3 locus 4 locus 5 locus 6
European    0.2868  0.5684  0.4422  0.4286  0.3828  0.7285
African     0.1356  0.4840  0.0602  0.0397  0.5977  0.9675
Chinese     0.1628  0.5958  0.7298  1.0000  0.3811  0.7986
American    0.0144  0.6990  0.3280  0.7421  0.6606  0.8603
Australian  0.1211  0.2274  0.5821  1.0000  0.2018  0.9000
           locus 7 locus 8 locus 9 locus 10
European    0.6386  0.0205  0.8055   0.5043
African     0.9511  0.0600  0.7582   0.6207
Chinese     0.7782  0.0726  0.7482   0.7334
American    0.7924  0.0000  0.8086   0.8636
Australian  0.9837  0.0396  0.9097   0.2976
> Dnei<-Rgendist(X)

....

Genetic Distance Matrix program, version 3.695

....

Distances calculated for species
    1           
    2            .
    3            ..
    4            ...
    5            ....

Distances written to file "outfile"

Done.

> Dnei
           European  African  Chinese American
African    0.078002                          
Chinese    0.080749 0.234698                 
American   0.066805 0.104975 0.053879        
Australian 0.103014 0.227281 0.063275 0.134756

> # Cavalli-Sforza (1967) distances
> Dcavalli<-Rgendist(X,method="Cavalli-Sforza")

....

Genetic Distance Matrix program, version 3.695

....

Distances calculated for species
    1           
    2            .
    3            ..
    4            ...
    5            ....

Distances written to file "outfile"

Done.

> Dcavalli
           European  African  Chinese American
African    0.181749                          
Chinese    0.181987 0.480537                 
American   0.129497 0.231519 0.147522        
Australian 0.260814 0.480491 0.123618 0.283144

An input matrix, as shown above, is only one of two ways that the data can be sent to Rgendist. The user can also supply a list of matrices in which each matrix contains the frequencies of each allele at a single locus (and thus the length of the list is equal to the number of loci in the analysis). See the function documentation for more information.

All the latest work on Rphylip (including package builds) can be obtained from GitHub.

That's all for now.

Sunday, February 16, 2014

New Rphylip function: Rdollop

After a hiatus to work on other stuff (including my winter-term tropical ecology course and various updates to phytools), I have finally returned to working on my R interface for Felsenstein's PHYLIP software package, Rphylip. The latest update is a new function, Rdollop which wraps around PHYLIP program DOLLOP, a program for Dollo & polymorphism parsimony tree inference. For more information about DOLLOP, refer to its documentation page.

Here's a quick demo using a binary primate dataset packaged with Rphylip. Note that this is not a dataset that I suspect evolved under Dollo's law, thus the example is simply demonstrative in nature:

> require(Rphylip)
Loading required package: Rphylip
Loading required package: ape
> data(primates.bin)
> tree<-Rdollop(primates.bin)

Dollo and polymorphism parsimony algorithm, version 3.695

...

Adding species:
   1. 7        
   2. 12       
   3. 6        
   4. 1        
   5. 3        
   6. 4        
   7. 10       
   8. 5        
   9. 2        
  10. 8        
  11. 11       
  12. 9        

Doing global rearrangements
  !-----------------------!
   .......................
   .......................

...

Dollo and polymorphism parsimony algorithm, version 3.695

Dollo parsimony method

     7 trees in all found


  +--------------------------------2  
  ! 
  !                    +-----------9        
  !                    ! 
  !        +-----------8     +-----12       
  !        !           !  +-11 
  !        !           !  !  !  +--11       
  !        !           +--9  +-10 
  !        !              !     +--10       
--1        !              ! 
  !     +--7              +--------8        
  !     !  ! 
  !     !  !                 +-----6        
  !     !  !              +--5 
  !     !  !              !  !  +--7        
  !  +--3  +--------------4  +--6 
  !  !  !                 !     +--5        
  !  !  !                 ! 
  +--2  !                 +--------4        
     !  ! 
     !  +--------------------------3        
     ! 
     +-----------------------------1        


requires a total of    154.000

...     

Translation table
-----------------
        1       Lemur
        2       Tarsier
        3       Sq.Monkey
        4       J.Macaque
        5       R.Macaque
        6       E.Macaque
        7       B.Macaque
        8       Gibbon
        9       Orangutan
        10      Gorilla
        11      Chimp
        12      Human

> tree
7 phylogenetic trees
> plot(tree,no.margin=TRUE)
Waiting to confirm page change...
...

... and obviously 6 more trees as well, in this case.

That's it. A Rphylip build containing this function can be downloaded from GitHub.

Tuesday, February 11, 2014

Small fix in evol.vcv and new S3 print method

A phytools user pointed out that there was a peculiar 'bug' in the phytools function evol.vcv due to an inconsistency between the variable names in the function definition for an internally used function and the names of the variables used inside this function. It is peculiar because (by a stroke of luck) the internally used variable name is already defined correctly within the main function which means that this bug can never cause a problem. I have fixed it anyway in a new function version and phytools build (here).

Since I already had the source code for this function open, I decided to add an S3 print method for objects returned by evol.vcv. This is in the style of what I'd already done for brownie.lite. This seems to work pretty well. Here's a quick demo.

> require(phytools)
Loading required package: phytools
Loading required package: ape
Loading required package: maps
Loading required package: rgl
> packageVersion("phytools")
[1] ‘0.3.91’
> tree<-pbtree(n=26,tip.label=LETTERS[26:1])
> Q<-matrix(c(-1,1,1,-1),2,2)
> rownames(Q)<-colnames(Q)<-c("A","B")
> tree<-sim.history(tree,Q,anc="A")
> plotSimmap(tree,lwd=3)
no colors provided. using the following legend:
      A       B
"black"   "red"
> V<-list(matrix(c(1,0,0,1),2,2),matrix(c(1,1.2,1.2,2),2,2))
> names(V)<-c("A","B")
> V
$A
     [,1] [,2]
[1,]    1    0
[2,]    0    1

$B
     [,1] [,2]
[1,]  1.0  1.2
[2,]  1.2  2.0

> X<-sim.corrs(tree,vcv=V)
> fit<-evol.vcv(tree,X)
> fit
ML single-matrix model:
        R[1,1]  R[1,2]  R[2,2]  k       log(L)
fitted  0.675   0.355   1.5944  5       -68.1635
      
ML multi-matrix model:
        R[1,1]  R[1,2]  R[2,2]  k       log(L)
A       0.7789  -0.3202 0.4378  8       -64.6364
B       0.6622  1.0262  2.7149

P-value (based on X^2): 0.0702

R thinks it has found the ML solution.

This is pretty much what we were going for. It might get a little messy if we have a lot of columns in X. That's it.

Saturday, February 1, 2014

New version of threshDIC for comparing OU & λ models

I just posted a new version of the function threshDIC that allows users to use the deviance information criterion (DIC) to compare Brownian, OU, and λ models for evolution of the liability on the tree in ancThresh. I implemented threshDIC primarily to compare alternative sequences for the threshold characters on the liability axis - but (theoretically) the same approach could be useful in choosing among alternative models for the liability.

DIC is similar to the better known AIC, except that it can be used for Bayesian approaches in which we are unable to maximize the likelihood and all we have is a sample from the posterior distribution obtained by MCMC. DIC estimates the effective parameterization of our model by computing the difference between the mean likelihoods evaluated for each of our samples from the posterior and the likelihood for our mean parameter values. The logic is that this difference will be increase with the effective number of parameters in the model (because when the number of parameters is large we tend to spend most of the time during MCMC far from the parameter values that maximize the likelihood).

This code is also in a new version of phytools here. (In fact, should you try to run it from source without updating phytools it will not work because it has an internal dependency only present in the last couple of non-CRAN versions.)