Bioinformatics
Bioinformatics applies software applications and databases to genomics and proteomics. Protein databases are populated with the results of classical protein research, as well as predictions computed from genomics. Bioinformatics software controls instruments and analyzes the results.
|
Proteomics Databases
Protein databases are at the heart of proteomics. These databases are usually freely available to everyone. The National Center for BioTechnology Information (NCBI) maintains a number of these databases. The figure on the left shows theNCBI's interconnected databases. The Molecular Information Agent and DBGET are other graphical portals showing interconnected protein and gene databases.
A more comprehesive list of specialized proteomic and genomic databases is published each year by Nucleic Acids Research along with articles describing each one.
Varieties of protein databases:
- General information about proteins — the most famous ones are SWISS-PROT, IPI, PIR, and NCBI. These are combined in UniProt, the world's most comprehensive catalog of information about proteins. A search on the Bioinformatic Harvester returns all that is known about a protein and its related gene in fifteen different databases.
- Protein sequence databases — amino acid sequences in the format ( FASTA) used with mass spectrometry measurements to identify proteins. These databases are often a portion of the general protein databases listed above, SWISS-PROT, IPI, PIR, NCBI and UniProt.
- Proteomics databases — data collected in proteomics experiments such as the protein identifications in PeptideAtlas, the Open Proteomics Database, and the Global Proteome Machine. There are also databases of experimental 2D gels.
- Three-dimensional structures of proteins — the most well-known is PDB.
- Protein-protein interactions — which proteins interact and with whom.
- Gene Ontology - a database of terms that classify protein functions, processes and subcellular locations. Go Browser, FatiGo and Gene Finder are web sites that link proteins to their Gene Ontology functions.
- Databases that relate proteins and genes to diseases - the most well-known is OMIM.
- Pubmed - the database of abstracts of all journal articles relevant to biology, medicine, and proteomics.
|
Bioinformatics Programs
Bioinformatics software for proteomics has three main parts:
- Interpreting mass spectra identifies proteins (discussed in more depth below).
- Computational biology is largely aimed at computing the three-dimensional shape of a protein from its one-dimensional amino acid sequence. A protein's 3D structure determines its function.
- Sequence comparison seeks to understand how proteins work by seeing what other proteins are similar to them. The standard program comparing sequences is BLAST or variations such as PSI-BLAST. For each protein in the NCBI database, Blink lists all other proteins matched by a pre-computed BLAST..
|
Protein Identification Software
As the diagram above shows, protein identification using tandem mass spectrometry requires software at a number of steps.
- Instrument control software runs the tandem mass spectrometer. In particular, it decides when a peak from the first component is good enough to examine in detail, and then switches to the second component to analyze the peptide fragments. Dynamic exclusion is the technique that prevents the second component from examining the same thing repeatedly.
- Peak-finding software takes the raw data from the tandem mass spectrometer and outputs a peak list. Peaks represent the number of peptides of a given mass and charge.
- Search engines search the spectra of peaks against a protein sequence (FASTA) database. SEQUEST and Mascot are the leading search engines. SEQUEST Sorcerer is a hardware accelerated version of SEQUEST. You can try Mascot on the web. If you need test data to try with Mascot, look at Peptide Atlas. A newer alternative to these search engines is Phenyx (which you can also try on the web) and a free alternatives are X! Tandem and OMSSA. A web site providing access to several proteomics software tools, including search engines, is ProteomeCommons.
- An alternative to searching the spectra in a protein sequence database, is to first determine from each spectrum what the amino acid sequence could have produced this spectrum. Then search with this sequence against the FASTA database. This approach is called de novo sequencing. Some de novo programs are: Peaks, a commercially supported program; Lutefisk, an open source program; and PepNovo, an web based program.
- Yet another approach is to find sequences of three to five amino acids using partial de novo sequencing, which is simpler than de novo sequencing the whole peptide. Then you can search the FASTA database with these sequence tags. InsPecT is a web based tag search program.
- Each search alternative identifies proteins, but with some degree of uncertainty. These results can be improved by using statistics to maximize the number of proteins identified while minimizing the false identifications. This analysis is available with Proteome Software's Scaffold program.
|
|