Proceedings of the 28th Annual Hawaii International Conference on Systen Sciences -- 1995


Use of Multiple Alignments in Protein Secondary Structure Prediction

V. Di Francesco, P. J. Munson and J. Garnier*

Analytical Biostatistics Section, Laboratory of Structural Biology, Division of Computer Research
Technology and *Fogarty International Center, National Institutes of Health, Bethesda, MD 20892-5626
valedf@helix.nih.gov, munson@helix.nih.gov, garnier@darwin.bu.edu


Abstract

Using a new database of 20 proteins not included in any of the previously used training datasets, we have incorporaled multiple alignment information from homologous proteins into two well- characterized prediction methods COMBINE (a jury method) and the Q-L (or quadratic-logistic) method. It is found that the increase in accuracy from the use of related proteins is similar for both methods (5.8% and 6.3%, respectively) yielding a per residue prediction accuracy (Q3) of 68.7% and 69.0%, respectively, for a three state prediction. Most of the improvement came from consideration of averaging, profiling or consensus predictions. Of this improvement, a small amount (0.5%) came from recognition that "gap-permissive" positions in the alignment are most frequently in the coil state. Our finding is consistent with the hypothesis of a common secondary structure for the aligned family, and that improved accuracy is due to reduced noise in the prediction

Introduction

The present databases of amino acid sequences contain a relatively large number of homologous sequences of proteins that are evolutionarily related and usually perform the same function. This means also that these homologous proteins share the same overall three dimensional structure. When one of the members of this protein family has a known structure, we may apply the principle of the protein modeling by homology [1]. If no structure is known for that family, some structural data can nevertheless be extracted from the amino acid sequences, concerning the secondary structure as made of [~alpha~]-helix, (H), [~beta~]-strand or extended structure (E) and non periodic structure, coil (C). When a single sequence is used, the accuracy of the best existing automated methods is 63-65% of correctly predicted residues in the three states, H, E and C [see reviews in 2-3]. However if the sequences of several members of the same family are known, it has been shown that secondary structure predictions can be significantly improved [4-6]. These enhancements in accuracy rely on multiple alignments of the homologous sequences which currently must have at least 25% of identical residues [5], a value corresponding to the threshold used for sequences longer than 80 residues exhibiting similar secondary structures [7]. We do not imply that lower percentages of identity between sequences might not also correspond to similar secondary and tertiary structures, but rather that the present alignment methods are not able to distinguish such proteins from others having a very different fold. Furthermore the quality of alignments between such distant sequences degrades with evolutionary distance. The treatment of the multiple alignment of sequences differs with the authors [4-6], but generally, these methods account for the individual prediction of secondary structure for each aligned sequence and at each position in the alignment. This is done through a profile [6] or through a consensus prediction [4, 5]. As the improvement brought to the prediction by the consensus prediction is very significant for the GOR or SlMPA algorithms [4, 5], we wanted to address the question whether more advanced and accurate prediction algorithms, such as COMBINE [8] or the quadratic- logistic algorithm [9], can also benefit from the multiple alignment and how this compares with the profile algorithm used by others and data already obtained. For this we used a data base of 20 proteins of high X-ray resolution, not bearing signifcant homology to any of the proteins used in developing the parameters for these different algonthms.
    We also wished to investigate the role of gaps in homologous sequence alignment in possibly augmenting prediction accuracy. Others have shown that presence of gaps in the alignment may be associated with coil propensity, and it is reasonable to assume that the exposed, loop regions of proteins are more permissive of mutational insertions and deletions, and these loop regions commonly assume random coil conformation [10].

U.S. Government Work Not Protected by U.S. Copyright 285 Proceedings of the 28th Annual Hawaii Intemational Conference
on System Sciences-1995
Biotechnology Computing, Vol. 5
Eds: L. Hunter & B. D. Shriver
IEEE Computer Society Press, Los Alamitos, CA