Monday, December 9, 2013

Rphylip: consense, proml, promlk, and some functions for handling protein sequences

Since my last post on Rphylip I've added R interfaces for a few more of PHYLIP programs: consense (Rconsense), proml (proml), and promlk (promlk). The latter two do ML phylogeny estimation (without & with a molecular clock, respectively) from protein sequences.

In addition to these new interfaces, I also added a few different functions for handling protein sequences in R. I should say that I first attempted to establish whether any such methods already existed in other phylogeny packages by running help.search("protein") on all of my installed packages. What I didn't think to do was search for "amino acid". Had I done so, I would have found the function read.aa in the phangorn package, with which at least my Rphylip function read.protein is basically redundant.

The protein handling functions in Rphylip are as follows:

read.protein, an amino acid sequence reading function that can accept data files in formats "fasta" or "sequential". If format="sequential" or format="fasta" with all sequences of the same length then read.protein will store the input data in a matrix; whereas if format="fasta" and sequences have different lengths, then the data will be stored as a list. In both cases the object is assigned the class attribute "proseq" (protein sequence, in case it wasn't obvious enough).

print.proseq is an S3 print method for objects of class "proseq". The print output is designed to mimic that of print.DNAbin in the ape package. (See demo below.)

Finally, as.proseq converts protein sequences from other formats. At the moment, it only recognizes & converts from objects of class phyDat (with attr(x,"type")="AA") from the phangorn package.

OK - here is a brief illustration of the Rproml proml interface using an amino acid dataset for chloroplasts from phangorn written to file:

> require(Rphylip)
Loading required package: Rphylip
Loading required package: ape
> X<-read.protein("chloroplast.aa")
> X
19 protein sequences in character format stored in a matrix.

All sequences of same length: 5144

Labels: Trico Nostoc Syn6301 Prochl Syn8102 Thermo ...

Amino acid composition:
    A     C     D     E     F     G     H     I     K     L
0.087 0.007 0.040 0.047 0.051 0.091 0.029 0.074 0.040 0.102
    M     N     P     Q     R     S     T     V     W     Y
0.025 0.036 0.048 0.040 0.052 0.054 0.052 0.073 0.022 0.032

> tree<-Rproml(X,speedier=TRUE,global=FALSE)

...

Amino acid sequence Maximum Likelihood method, version 3.695

...

Adding species:
   1. 14       
   2. 10       
   3. 11       
   4. 7        
   5. 16       
   6. 19       
   7. 3        
   8. 2        
   9. 9        
  10. 8        
  11. 13       
  12. 12       
  13. 5        
  14. 1        
  15. 15       
  16. 18       
  17. 4        
  18. 17       
  19. 6

...

Jones-Taylor-Thornton model of amino acid change


        +-2        
     +-12 
  +--3  +-16       
  |  | 
  |  +-----7        
  | 
  |              +-----18   
  |              | 
  |           +-15     +----12  
  |           |  |  +-11 
  |           |  |  |  +--10    
  |           |  +--9 
  |           |     |  +--------11
  |        +--2     +--6 
  |        |  |        +-------9
  |        |  | 
  |        |  |       +----15  
  |        |  |   +--13 
  |        |  |   |   +--17    
  |     +--1  +--10 
  |     |  |      |  +------13 
  |     |  |      +--5 
  |     |  |         |  +-----19
  |     |  |         +--4 
  |  +-14  |            +-------14
  |  |  |  | 
  |  |  |  +-----------8     
  |  |  | 
  |  |  |  +---3        
 16--8  +-17 
  |  |     |    +------4    
  |  |     +----7 
  |  |          +---5      
  |  | 
  |  +----6        
  | 
  +----1        


remember: this is an unrooted tree!

...

Translation table
-----------------
        1       Trico
        2       Nostoc
        3       Syn6301
        4       Prochl
        5       Syn8102
        6       Thermo
        7       Syn6803
        8       Gloeo
        9       Odont
        10      Porph
        11      Cyanid
        12      Gracil
        13      Nephros
        14      Chlamy
        15      Arabid
        16      Anabae
        17      March
        18      Cyanoph
        19      Chlorel

> plot(tree,type="unrooted",no.margin=TRUE,edge.width=2, cex=0.8)

Rphylip can be obtained from its webpage or from github.

1 comment:

  1. Hi Liam,

    Thanks so much for this integration. Would it be possible to add support for Rivas and Eddy"s dnaml-erate program? It is an extension of the dnaml executable for an F84 + non-reversible insertion/deletions.

    ReplyDelete

Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.