|
Most recent protein secondary structure prediction methods use sequence alignments to improve the prediction quality. We investigate the relationship between the location of secondary structural elements, gaps, and variable residue positions in multiple sequence alignments. We further investigate how these relationships compare with those found in structurally aligned protein families. We show how such associations may be used to improve the quality of prediction of the secondary structure elements, using the Quadratic-Logistic method with profiles. Furthermore, we analyze the extent to which the number of homologous sequences influences the quality of prediction. The analysis of variable residue positions shows that surprisingly, helical regions exhibit greater variability than do coil regions, which are generally thought to be the most common secondary structure elements in loops. However, the correlation between variability and the presence of helices does not significantly improve prediction quality. Gaps are a distinct signal for coil regions. Increasing the coil propensity for those residues occurring in gap regions enhances the overall prediction quality. Prediction accuracy increases initially with the number of homologues, but changes negligibly as the number of homologues exceeds about 14. The alignment quality affects the prediction more than other factors, hence a careful selection and alignment of even a small number of homologues can lead to significant improvements in prediction accuracy. Keywords: prediction; protein secondary structure; sequence alignments | |
|
For several decades, the protein folding problem has been among the most challenging problems in the biological sciences. In 1994, a bona fide protein structure prediction contest was organized with the aim of assessing the real virtues and defects of "state-of-the-art" methodologies. Analysis of the structures predicted by the contestants (Moult et al., 1995) has generally shown that even the most promising techniques need considerable improvement, and that the protein folding problem should still be considered unresolved. Briefly, ab initio calculations, although promising, are feasible only for small-size proteins; there have been no major breakthroughs in the molecular modeling techniques, and threading techniques need further development. Protein secondary structure prediction was reevaluated and recognized as a useful tool for establishing starting points for tertiary structure calculation of protein structures.    Early approaches to protein secondary structure prediction from the primary sequence had a prediction accuracy Q3 (percentage of correctly predicted residues in the three states: alpha- | helix, ß-strand, and coil) of about 57% (Chou & Fasman, 1978; Garnier et al., 1978). Various later attempts to improve the accuracy (Gibrat et al., 1987; Biou et al., 1988; Levin & Garnier, 1988; Holley & Karplus, 1989; Qian & Sejnowski, 1989; King & Sternberg, 1990; Salzberg & Cost, 1992; Stolorz et al., 1992; Zhang et al., 1992; Munson et al., 1994) with innovative artificial intelligence techniques, such as neural networks, machine learning, nearest neighbors, and combined approaches, have not achieved prediction accuracies greater than 66%. The inclusion of evolutionarily related sequences into the prediction scheme has given a significant boost in prediction accuracy, up to values of about 68-72% (Zvelebil et al., 1987; Levin et al., 1993; Rost & Sander, 1993, 1994; Rost et al., 1994a; Di Francesco et al., 1995; Salamov & Solovyev, 1995). In general, the suggested explanation for these improvements in prediction accuracy is that sequence alignments of homologous proteins should emulate as closely as possible the structural alignment. Thus, aligned residues, in particular those in the core of proteins, should belong to the same secondary structure elements. Sequence alignments may be utilized to obtain a consensus from the predictions based on each homologous sequence, or they may be used to build sequence profiles at each aligned position. In addition to the |
| Reprint requests to: Valentina Di Francesco, NIH/DCRT/LSB, Analytical Biostatistics Section, 12 South MSC 5626, Bldg. 12A-Room 2041, Bethesda, Maryland 20892-5626; e-mail: valedf@helix.nih.gov. | |
| 106 | |