Since my last post on Rphylip I've added R interfaces for a few more of PHYLIP programs: consense (Rconsense), proml (proml), and promlk (promlk). The latter two do ML phylogeny estimation (without & with a molecular clock, respectively) from protein sequences.
In addition to these new interfaces, I also added a few different functions for handling protein sequences in R. I should say that I first attempted to establish whether any such methods already existed in other phylogeny packages by running help.search("protein") on all of my installed packages. What I didn't think to do was search for "amino acid". Had I done so, I would have found the function read.aa in the phangorn package, with which at least my Rphylip function read.protein is basically redundant.
The protein handling functions in Rphylip are as follows:
read.protein, an amino acid sequence reading function that can accept data files in formats "fasta" or "sequential". If format="sequential" or format="fasta" with all sequences of the same length then read.protein will store the input data in a matrix; whereas if format="fasta" and sequences have different lengths, then the data will be stored as a list. In both cases the object is assigned the class attribute "proseq" (protein sequence, in case it wasn't obvious enough).
print.proseq is an S3 print method for objects of class "proseq". The print output is designed to mimic that of print.DNAbin in the ape package. (See demo below.)Finally, as.proseq converts protein sequences from other formats. At the moment, it only recognizes & converts from objects of class phyDat (with attr(x,"type")="AA") from the phangorn package.
OK - here is a brief illustration of the Rproml proml interface using an amino acid dataset for chloroplasts from phangorn written to file:
Loading required package: Rphylip
Loading required package: ape
> X<-read.protein("chloroplast.aa")
> X
19 protein sequences in character format stored in a matrix.
All sequences of same length: 5144
Labels: Trico Nostoc Syn6301 Prochl Syn8102 Thermo ...
Amino acid composition:
A C D E F G H I K L
0.087 0.007 0.040 0.047 0.051 0.091 0.029 0.074 0.040 0.102
M N P Q R S T V W Y
0.025 0.036 0.048 0.040 0.052 0.054 0.052 0.073 0.022 0.032
> tree<-Rproml(X,speedier=TRUE,global=FALSE)
...
Amino acid sequence Maximum Likelihood method, version 3.695
...
Adding species:
1. 14
2. 10
3. 11
4. 7
5. 16
6. 19
7. 3
8. 2
9. 9
10. 8
11. 13
12. 12
13. 5
14. 1
15. 15
16. 18
17. 4
18. 17
19. 6
...
Jones-Taylor-Thornton model of amino acid change
+-2
+-12
+--3 +-16
| |
| +-----7
|
| +-----18
| |
| +-15 +----12
| | | +-11
| | | | +--10
| | +--9
| | | +--------11
| +--2 +--6
| | | +-------9
| | |
| | | +----15
| | | +--13
| | | | +--17
| +--1 +--10
| | | | +------13
| | | +--5
| | | | +-----19
| | | +--4
| +-14 | +-------14
| | | |
| | | +-----------8
| | |
| | | +---3
16--8 +-17
| | | +------4
| | +----7
| | +---5
| |
| +----6
|
+----1
remember: this is an unrooted tree!
...
Translation table
-----------------
1 Trico
2 Nostoc
3 Syn6301
4 Prochl
5 Syn8102
6 Thermo
7 Syn6803
8 Gloeo
9 Odont
10 Porph
11 Cyanid
12 Gracil
13 Nephros
14 Chlamy
15 Arabid
16 Anabae
17 March
18 Cyanoph
19 Chlorel
> plot(tree,type="unrooted",no.margin=TRUE,edge.width=2, cex=0.8)
Rphylip can be obtained from its webpage or from github.
No comments:
Post a Comment
Note: due to the very large amount of spam, all comments are now automatically submitted for moderation.