V. Di Francesco, P. J. Munson and J. Garnier*
Analytical Biostatistics Section, Laboratory of Structural Biology,
Division of Computer Research
Technology and *Fogarty International Center, National Institutes of
Health, Bethesda, MD 20892-5626
valedf@helix.nih.gov, munson@helix.nih.gov,
garnier@darwin.bu.edu
Using a new database of 20 proteins not included in any of the previously used training datasets, we have incorporaled multiple alignment information from homologous proteins into two well- characterized prediction methods COMBINE (a jury method) and the Q-L (or quadratic-logistic) method. It is found that the increase in accuracy from the use of related proteins is similar for both methods (5.8% and 6.3%, respectively) yielding a per residue prediction accuracy (Q3) of 68.7% and 69.0%, respectively, for a three state prediction. Most of the improvement came from consideration of averaging, profiling or consensus predictions. Of this improvement, a small amount (0.5%) came from recognition that "gap-permissive" positions in the alignment are most frequently in the coil state. Our finding is consistent with the hypothesis of a common secondary structure for the aligned family, and that improved accuracy is due to reduced noise in the prediction
Introduction
The present databases of amino acid sequences contain a relatively
large number of homologous sequences of proteins that are
evolutionarily related and usually perform the same function. This
means also that these homologous proteins share the same overall
three dimensional structure. When one of the members of this
protein family has a known structure, we may apply the principle of
the protein modeling by homology [1]. If no structure is known for
that family, some structural data can nevertheless be extracted from
the amino acid sequences, concerning the secondary structure as
made of [~alpha~]-helix, (H), [~beta~]-strand or extended structure (E)
and non periodic structure, coil (C). When a single sequence is used,
the accuracy of the best existing automated methods is 63-65% of
correctly predicted residues in the three states, H, E and C [see
reviews in 2-3]. However if the sequences of several members of the
same family are known, it has been shown that secondary structure
predictions can be significantly improved [4-6]. These enhancements
in accuracy rely on multiple alignments of the homologous sequences
which currently must have at least 25% of identical residues [5], a
value corresponding to the threshold used for sequences longer than
80 residues exhibiting similar secondary structures [7]. We do not
imply that lower percentages of identity between sequences might
not also correspond to similar secondary and tertiary structures, but
rather that the present alignment methods are not able to distinguish
such proteins from others having a very different fold. Furthermore
the quality of alignments between such distant sequences degrades
with evolutionary distance. The treatment of the multiple alignment
of sequences differs with the authors [4-6], but generally, these
methods account for the individual prediction of secondary structure
for each aligned sequence and at each position in the alignment. This
is done through a profile [6] or through a consensus prediction [4, 5].
As the improvement brought to the prediction by the consensus
prediction is very significant for the GOR or SlMPA algorithms [4, 5],
we wanted to address the question whether more advanced and
accurate prediction algorithms, such as COMBINE [8] or the quadratic-
logistic algorithm [9], can also benefit from the multiple alignment
and how this compares with the profile algorithm used by others and
data already obtained. For this we used a data base of 20 proteins of
high X-ray resolution, not bearing signifcant homology to any of the
proteins used in developing the parameters for these different
algonthms.
   
We also wished to investigate the role of gaps in homologous
sequence alignment in possibly augmenting prediction accuracy.
Others have shown that presence of gaps in the alignment may be
associated with coil propensity, and it is reasonable to assume that
the exposed, loop regions of proteins are more permissive of
mutational insertions and deletions, and these loop regions
commonly assume random coil conformation [10].
| U.S. Government Work Not Protected by U.S. Copyright | 285 |
Proceedings of the 28th Annual Hawaii Intemational
Conference on System Sciences-1995 Biotechnology Computing, Vol. 5 Eds: L. Hunter & B. D. Shriver IEEE Computer Society Press, Los Alamitos, CA |