Parsing Uniprot Files
Uniprot is a comprehensive data source about proteins. It contains protein sequences and
information about proteins in a large number of species. As computer scientists, you can
think of a protein sequence as a string over the alphabet of amino acids. For example, the
sequence of one component of the human hemoglobin protein (its alpha subunit) is:
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLS HGSAQVKGHG
KKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTL AAHLPAEFTPA
VHASLDKFLASVSTVLTSKYR
The id given by Uniprot to this protein is P69905. This id is called an accession number
(AC). The Uniprot knowledgebase is available in a variety of formats, including a simple
at
le format that you are going to parse. Part of the entry for the hemoglobin alpha globin is
as follows:
ID HBA_HUMAN Reviewed; 142 AA.
AC P69905; P01922; Q1HDT5; Q3MIF5; Q53F97; Q96KF1; Q9NYR7; Q9UCM0;
DT 21-JUL-1986, integrated into UniProtKB/Swiss-Prot.
DT 23-JAN-2007, sequence version 2.
DT 28-JUL-2009, entry version 75.
DE RecName: Full=Hemoglobin subunit alpha;
DE AltName: Full=Hemoglobin alpha chain;
DE AltName: Full=Alpha-globin;
GN Name=HBA1;
GN and
GN Name=HBA2;
OS Homo sapiens (Human).
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
OC Catarrhini; Hominidae; Homo.
OX NCBI_TaxID=9606;
....
SQ SEQUENCE 142 AA; 15258 MW; 15E13666573BBBAE CRC64;
MVLSPADKTN VKAAWGKVGA HAGEYGAEAL ERMFLSFPTT KTYFPHFDLS HGSAQVKGHG
KKVADALTNA VAHVDDMPNA LSALSDLHAH KLRVDPVNFK LLSHCLLVTL AAHLPAEFTP
AVHASLDKFL ASVSTVLTSK YR
//
Assignment 1
Things you should know about this format:
Each type of information is specied by a two-letter code at the beginning of the line.
The accession number for this protein is found in the AC line (and it does not include
the semi-colon). Note that the protein has several accession numbers, and the rst one
is the primary accession number.
The sequence is found in the lines following the SQ line until the end-of-entry ter-
minator (//). Spaces are not part of the sequence, and are there only for human
readability.
Each entry is terminated by //.
You can nd the Uniprot entry for this protein at
Hemoglobin subunit alpha - Homo sapiens (Human).
The Uniprot knowledge base is also available in XML which is a modern, generic language for
representing data. Your task is to parse a Uniprot
at le containing entries for an unknown
number of proteins (currently the database contains information on 495880 proteins coming
from 5208 species). For each protein you will need to extract its primary accession number
and sequence. The data that you extracted then needs to be written to another le which
contains the accession numbers and sequences, and should have the following format:
>header1
sequence1
>header2
sequence2
....
>headerN
sequenceN
This format is called Fasta (after the name of a program written in the early 90s which used
this format). Here's a description of the format:
FASTA - Wikipedia, the free encyclopedia format.
In our case the header of the Fasta le needs to contain the primary accession number of
the protein.
Specications, notes, and hints
The name of the source code le must be Parser.java. The program should be exe-
cutable as: java Parser uniprotFile fastaFile where uniprotFile is the le name
for the input le containing Uniprot entries, and fastaFile is the output le in Fasta
format.
Assignment 1
Your program needs to have at least two methods in addition to the Main method:
a method for extracting the accession numbers and sequences out of the Uniprot le.
This method needs to return the data as two arrays (think how to do that, since a
method can only have a single return value). Also, you don't know in advance how
many proteins are represented in the Uniprot le, so in order to dene a arrays of the
right size you need to count the number of entries in the le. This is the second method
your program needs to have. Java has data structures that can grow dynamically, but
you shouldn't use them. As a general rule, only use Java features that were covered in
class.