register Log In

 

eu

 

 

THE MISPRED PROJECT

Project objectives

The main objective of the MisPred project is to identify mispredicted and abnormal genes/proteins primarily from metazoan genomes in order to improve the quality of predictions.
The MisPred approach is based on the principle that a protein-coding gene is likely to be mispredicted if some of the features of the predicted protein conflict with our current knowledge about proteins. The routines of MisPred serve to identify suspicious predicted protein sequences which conflict with at least one of the dogmas and could not, therefore, become correctly folded and functional macromolecules in vivo.
By identifying erroneous protein sequences, the MisPred pipeline serves to inform both the creators of the predictive algorithms as well as experimentalists of the reliability of predictions, to thereby assist in the improvement of the quality of the available datasets. The principles of quality control are illustrated below with five of the MisPred tools.

Principles of quality control

1. Conflict between the presence of extracellular Pfam-A domain(s) in a protein and the absence of appropriate sequence signals.

Conflict 1 is based on the dogma that the subcellular localization of extracellular and transmembrane proteins is defined by the presence of appropriate sequence signals. For the domain-based prediction of subcellular localization of proteins only those Pfam-A domain families (Finn et al. 2006) have been incorporated into the MisPred pipeline that are exclusively extracellular, cytoplasmic or nuclear, respectively. Pfam-A domains that are known not to be restricted to a particular cellular compartment, such as immunoglobulin domains and fibronectin type III domains (i.e. domains that are ‘multilocale’), were not utilized in these analyses. Our domain co-occurrence analyses (Tordai et al. 2005) have identified 166 obligatory extracellular, 115 obligatory cytoplasmic and 126 obligatory nuclear Pfam-A domain families as being restricted to the respective subcellular compartment, the majority of which are also identified as such in the SMART database (Letunic et al. 2004). These Pfam-A domains are listed inTable 1-3.
This MisPred tool identifies proteins containing extracellular Pfam-A domains which occur exclusively in extracellular proteins or extracytoplasmic parts of type I, type II, and type III single pass or multispanning transmembrane proteins and examines whether the proteins also have secretory signal peptide, signal anchor or transmembrane segments that could target these domains to the extracellular space. Proteins that contain obligatory extracellular domains but lack secretory signal peptide, signal anchor and transmembrane segment(s) are considered erroneous since in the absence of these signals their extracellular domain (usually rich in disulfide-bonds) will not be delivered to the extracytoplasmic space where it is properly folded, stable and functional. Mislocalized extracellular domains are likely to be misfolded in the reductive milieu of the cytoplasm and such proteins are likely to be rapidly degraded by the protein quality control system of the cell.

2. Conflict between the presence of extracellular and cytoplasmic Pfam-A domains in a protein and the absence of transmembrane segments.

Conflict 2 is based on the principle that multidomain proteins that contain both obligatory extracellular and obligatory cytoplasmic domains must have at least one transmembrane segment to pass through the cell membrane. The MisPred tool identifies proteins containing both extracellular and cytoplasmic Pfam-A domains and examines whether they also contain transmembrane helices. Proteins that contain both obligatory extracellular and obligatory cytoplasmic domains but lack transmembrane segment(s) separating them are considered suspicious (abnormal and nonviable).

3. Co-occurrence of nuclear and extracellular Pfam-A domains in a predicted multidomain protein.

Conflict 3 is based upon the rule that protein domains that occur exclusively in the extracellular space and those that occur exclusively in the nucleus do not co-occur in a single multidomain protein (Tordai et al. 2005). The explanation for this rule is that a protein that contains both extracellular and nuclear domains would not be delivered to a compartment where both types of domains would be correctly folded and fully functional. Accordingly, proteins that contain both obligatory extracellular and obligatory nuclear domains are considered abnormal and nonviable since they cannot be delivered to a cellular compartment relevant for both types of the constituent domains.

4. Domain size deviation

Conflict 4 is based on the observation that the protein fold is highly conserved in a domain family, therefore the number of amino acid residues in closely related members of a globular domain family usually fall into a relatively narrow range (Wheelan et al. 2000, Wolf et al. 2007). This phenomenon reflects the fact that insertion/deletion of longer segments into/from structural domains may yield proteins that are unable to fold efficiently into a correctly folded, viable and stable three-dimensional structure (Tordai et al. 2005, Watters et al. 2007, Wolf et al. 2007). MisPred uses Pfam-A domain families that have a well-defined and conserved sequence length range and well-characterized members of the family (in the UniProtKB/Swiss-Prot database) do not deviate from the average domain size by more than 2 SD values. Approximately 85% of all Pfam-A families present in Metazoa turned out to be suitable for this task. Proteins containing domains that consist of a significantly larger or smaller number of residues than closely related members of that domain family (in this conservative approach the cutoff was set to 40% of the actual domain length) may be suspected to be abnormal and nonviable.

5. Chimeric proteins parts of which are encoded by exons located on different chromosomes.

Conflict 5 is based on the rule that a protein is encoded by exons located on a single chromosome. According to this dogma, proteins whose parts are encoded by two or more different genes located on distinct chromosomes are considered abnormal.

6. Conflict between the presence of secretory signal peptide and cytoplasmic Pfam-A domains in a protein and the absence of transmembrane segments.

Conflict 6 is based on the principle that multidomain proteins that contain both secretory signal peptide and obligatory cytoplasmic domains must have at least one transmembrane segment to pass through the cell membrane. This tool identifies proteins containing both secretory signal peptide and cytoplasmic Pfam-A domains and examines whether they also contain transmembrane helices. Proteins that contain both secretory signal peptide and obligatory cytoplasmic domains but lack transmembrane segment(s) separating them are suspected to be abnormal and nonviable.

7. Conflict between the presence of GPI-anchor in a protein and the absence of appropriate sequence signals.

Conflict 7 is based on the rule that proteins attached to the outer cell membrane via a C-terminal GPI anchor contain a secretory signal peptide that directs them to the extracellular space. Accordingly, proteins that contain a GPI-anchor but lack a secretory signal peptide are considered to be abnormal.

8. Co-occurrence of GPI-anchor and cytoplasmic Pfam-A domains in a protein.

Conflict 8 is based on the dogma that in the case of GPI-anchored proteins the whole protein resides in the extracellular space therefore it may contain only extracellular domains but not cytoplasmic domains. Accordingly, proteins that contain a GPI-anchor and obligatory cytoplasmic domains are considered to be abnormal.

9. Co-occurrence of GPI-anchor and nuclear Pfam-A domains in a protein.

Conflict 9 is based on the dogma that in the case of GPI-anchored proteins the whole protein resides in the extracellular space therefore it may contain only extracellular domains but not nuclear domains. Accordingly, proteins that contain a GPI-anchor and obligatory nuclear domains are considered to be abnormal.

10. Co-occurrence of GPI-anchor and transmembrane segments in a protein.

Conflict 10 is based on the principle that in the case of GPI-anchored proteins the whole protein resides in the extracellular space therefore it may contain only extracellular domains but not transmembrane helices. Accordingly, proteins that contain a GPI-anchor and transmembrane helices are considered to be abnormal.

11. Domain architecture deviation

Conflict 11 focuses on mispredictions that affect the domain organization of proteins. The rationale behind this tool is that changes in domain architecture are relatively rare, therefore if we see an apparently ‘novel’ domain architecture in closely related organisms than this is more likely to reflect error(s) in gene prediction than true innovation. (This tool will be included in the next release of MisPred.)

Methods

Each MisPred routine is based upon generally accepted rules about the properties of protein-coding genes and correctly folded, functionally competent protein molecules. Each routine combines reliable bioinformatic methodologies, as well as in-house programs, to analyze protein sequences. The lists of obligatory extracellular, cytoplasmic and nuclear Pfam-A domain families used in the case of Conflict 1, 2, 3 are listed in Tables 1-3. The lists of Pfam-A domain families suitable for the study of domain integrity in the case of Conflict 4 are shown in Table 4. Secretory signal peptides (PrediSi, SignalP, Phobius), transmembrane helices (TMHMM, Phobius), GPI-anchors (DGPI) and Pfam-A domains (hmmscan) are identified by standard methods. Sequence alignments were carried out using BLAST. Protein sequences were aligned onto the genome using the BLAT program in the case of Conflict 5.

The detailed description of the constituents, tool logic and performance (specificity and sensitivity) of the various MisPred tools is found in (7).

The results of the latest analyses are stored in the MisPred database and are accessible through the query page under Search MisPred. It is possible to query the database by using protein accession numbers and protein sequences, and also can be filtered by the name of species, the type of conflict, the database, and the combinations of these features. It also provides links to various databases/resources to obtain further information about the nature of protein and domains. The results of the analysis of previous database versions are also accessible clicking on the Search in archive checkbox.

The annotations relating to EnsEMBL proteins are viewable also through EnsEMBL using the Distributed Annotation System.

We have extended our analyses to the Gencode sequences. The results of these analyses are found under Gencode info.

We have analyzed protein sequences of the Swiss-Prot and TrEMBL sections of the UniProt Knowledgebase (UniProtKB), and protein sequences of the EnsEMBL and the NCBI/RefSeq database. The results of these analyses on the current database versions are summarized in Statistics.

Analyze your sequence

If you do not find the protein sequence you are interested in in the MisPred database, you can analyze the sequence under Analyze Your Sequence.

References

 

  1. Finn RD, Mistry J, Schuster-Bockler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR, Sonnhammer EL, Bateman A. (2006) Pfam: clans, web tools and services. Nucleic Acids Research 34: D247-51.
  2. Letunic I, Copley R, Schmidt S, Ciccarelli F, Doerks T, Schultz J, Ponting C, Bork P (2004) SMART 4.0: towards genomic data integration. Nucleic Acids Research 32:D142-4.
  3. Tordai H, Nagy A, Farkas K, Bányai L, Patthy L (2005) Modules, multidomain proteins and organismic complexity. The FEBS Journal 272:5064-78. http://www.ncbi.nlm.nih.gov/pubmed/16176277
  4. Watters AL, Deka P, Corrent C, Callender D, Varani G, Sosnick T, Baker D (2007) The highly cooperative folding of small naturally occurring proteins is likely the result of natural selection. Cell 128:613-24.
  5. Wheelan S, Marchler-Bauer A, Bryant S (2000) Domain size distributions can predict domain boundaries. Bioinformatics 16:613-618
  6. Wolf Y, Madej T, Babenko V, Shoemaker B, Panchenko A (2007) Long-term trends in evolution of indels in protein sequences. BMC Evolutionary Biology 7:19
  7. Nagy A. and Patthy L. MisPred: a resource for identification of erroneous protein sequences in public databases. Database (2013) Vol. 2013: article ID bat053; doi:10.1093/database/bat053