Ravi On Computational and Technology Blogging: FASTA format of biological sequence representation and small Perl script to create the format.....

FASTA format can be used to represent either one or several sequences of the sequence in one file. Series of one sequence, combined, represent multisequence file. The best source for the description of FASTA / Pearson format documentation Suite FASTA programs. It can be downloaded from any of the free distribution of FASTA and advise its own (see fasta20.doc, fastaVN.doc or fastaVN.me - where VN is the version number).

Sequence in FASTA format is represented as a series of lines that should be no longer than 120 characters, and usually does not exceed 80 characters. This is probably because for preallocation fixed-sized lines in the software: while the majority of users rely on Dec. VT (or compatible) terminals that can display 80 or 132 characters per line. Most people prefer a large font in the 80-character mode, and so was the recommended mode to use 80 characters or less (often 70) in the line of FASTA.

The first line in the FASTA file begins with either ">" ("more") or the symbol ";" (semicolon) and was accepted as a comment. Follow the lines beginning with a semicolon will be ignored by the software. Since only the comment used in the first place, it quickly became used to store a brief description of the sequence, often beginning with a unique number, joining the library, and eventually it became a routine use to always use the ">" on the front line and do not use "; "comments (which are otherwise ignored.)

After the initial line (used to uniquely describe the sequence) is the actual sequence itself into a standard single letter code. Nothing but the correct code will be ignored (including spaces, tabs, asterisks, etc. ..). Originally it was also common for the end of the sequence with "*" (asterisk) character (by analogy with the PIR formatted sequence) and, for the same reason, to leave a blank line between description and sequence.

For example......this is protein sequence of SOS protein involve in signalling pathaway in fruit fly

>gi|18110536|ref|NP_476597.2| Son of sevenless [Drosophila melanogaster]

MFSGPSGHAHTISYGGGIGLGTGGGGGSGGSGSGSQGGGGGIGIGGGGVAGLQDCDGYDFTKCENAARWR

GLFTPSLKKVLEQVHPRVTAKEDALLYVEKLCLRLLAMLCAKPLPHSVQDVEEKVNKSFPAPIDQWALNE

AKEVINSKKRKSVLPTEKVHTLLQKDVLQYKIDSSVSAFLVAVLEYISADILKMAGDYVIKIAHCEITKE

DIEVVMNADRVLMDMLNQSEAHILPSPLSLPAQRASATYEETVKELIHDEKQYQRDLHMIIRVFREELVK

IVSDPRELEPIFSNIMDIYEVTVTLLGSLEDVIEMSQEQSAPCVGSCFEELAEAEEFDVYKKYAYDVTSQ

ASRDALNNLLSKPGASSLTTAGHGFRDAVKYYLPKLLLVPICHAFVYFDYIKHLKDLSSSQDDIESFEQV

QGLLHPLHCDLEKVMASLSKERQVPVSGRVRRQLAIERTRELQMKVEHWEDKDVGQNCNEFIREDSLSKL

GSGKRIWSERKVFLFDGLMVLCKANTKKQTPSAGATAYDYRLKEKYFMRRVDINDRPDSDDLKNSFELAP

RMQPPIVLTAKNAQHKHDWMADLLMVITKSMLDRHLDSILQDIERKHPLRMPSPEIYKFAVPDSGDNIVL

EERESAGVPMIKGATLCKLIERLTYHIYADPTFVRTFLTTYRYFCSPQQLLQLLVERFNIPDPSLVYQDT

GTAGAGGMGGVGGDKEHKNSHREDWKRYRKEYVQPVQFRVLNVLRHWVDHHFYDFEKDPMLLEKLLNFLE

HVNGKSMRKWVDSVLKIVQRKNEQEKSNKKIVYAYGHDPPPIEHHLSVPNDEITLLTLHPLELARQLTLL

EFEMYKNVKPSELVGSPWTKKDKEVKSPNLLKIMKHTTNVTRWIEKSITEAENYEERLAIMQRAIEVMMV

MLELNNFNGILSIVAAMGTASVYRLRWTFQGLPERYRKFLEECRELSDDHLKKYQERLRSINPPCVPFFG

RYLTNILHLEEGNPDLLANTELINFSKRRKVAEIIGEIQQYQNQPYCLNEESTIRQFFEQLDPFNGLSDK

QMSDYLYNESLRIEPRGCKTVPKFPRKWPHIPLKSPGIKPRRQNQTNSSSKLSNSTSSVAAAAAASSTAT

SIATASAPSLHASSIMDAPTAAAANAGSGTLAGEQSPQHNPHAFSVFAPVIIPERNTSS

Lets learn a Perl script that take a string as a input from a file (String represent sequence of nucleotide and /or protein) as an input and generate FASTA format.

#!/usr/bin/perl

$file = $ARGV[0];   # .....input file name from argument

open(FH, $file);       #........ Use file handle

@file_aray = <FH>;     #....... Take all the file lines in one array

close(FH);                    #.......... Close the file

foreach $line(@file_aray) 

{

     chomp $line;

     push @temp, $line; 

}

$sequence = join('',@temp);

print ">Sequence1\n"

$counter = 0;

@seqaray = split(//,$sequence);

for($i = 0; $i>@seqaray;$i++)

{

       print $seqaray[$i];

       $counter++;

       if ($counter eq 60)

{

              print "\n";

              $counter = 0;

}

Ravi On Computational and Technology Blogging

Total Hits

Sunday, February 20, 2011

FASTA format of biological sequence representation and small Perl script to create the format.....

No comments:

Post a Comment