Secondary Structure Prediction using Penalized Likelihood Models
Peter J. Munson, Valentina Di Francesco, Raul Porrelli
Analytical Biostatistics Section
Laboratory of Structural Biology, Division of Computer Research and Technology,
National Institutes of Health, Bethesda, Md 20892
Protein secondary structure (the structural state of individual amino acids in the protein chain) can be predicted from the knowledge of the primary sequence. Fully accurate prediction would solve the long standing and difficult protein folding problem. Since protein sequence is easily determined while structure determination can take years, the problem has immediate and wide relevance. Many investigators have approached the secondary structure problem with a variety of techniques ranging from simple heuristics to sophisticated neural networks and associative memory hamiltonians. Such methods have achieved only limited prediction accuracy.
We have applied parametric and semi-parametric statistical modeling techniques in an attempt to improve earlier results. Crossvalidation, a computationally intensive task, is used to determine the appropriate level of model complexity. In a three state model of secondary structure (helix, sheet, coil) describing the local conformation of the protein polymer chain, we are able to attain an accuracy rate between 63 and 66% using this approach. Maximum likelihood estimates are more satisfactory than the "information theory" method used by previous authors. The advantage of the fully parametric approach is that model parameter values have meaningful biophysical interpretations. The semi-parametric approach provides some parametric information together with protection from model mix-specification and allows estimation to proceed in the face of an overparameterized system.
IntroductionSince the early 1970's biochemists have recognized that the secondary structural state of protein residues could be predicted, at least partially, from the local amino acid sequence. Chou and Fasman (Chou and Fasman, 1978, Chou and Fasman, 1978) first computed "conformational parameters' or preference indices of various residues for being in each structural stale. Gamier and others (Gamier, et al, 1978) presented a method, now termed the "GOR" method, for combining information for several residues in a local sequence, and were able to obtain a prediction rate of 57%. Later this group boosted the claimed prediction rate to about 63%, using a larger data set together with a consideration of the effect of certain pairs of residues (Gibrat, et al., 1987). This rate appears to be close to the upper limit of prediction accuracy, even in the face of at least seven (Holley and Karplus, 1989, King and Sternberg,
1990, Kneller, et al., 1990, Qian and Sejnowski, 1989, Salzberg and Cost, 1992, Stolorz, et al., 1992, Zhang, et al., 1992) independent attempts to improve it using artificial intelligence techniques such as neural networks, machine learning, nearest- neighbor and combined approaches. However, these investigators have by no means tested every possible model for protein structure prediction, nor have they been concerned with optimal prediction methodology (excepting Stolorz). Indeed, the widely accepted Anfinsen hypothesis (Anfinsen, et al., 1961), holds that sequence uniquely determines protein structure, suggesting that a "perfect" model for structure prediction must exist. As the Brookhaven Data Bank of protein structures grows, it is reasonable to think that the complete rule (or model) for structure prediction will eventually become discernible, if only one is clever enough to recognize it
We have taken a statistical approach, viewing the situation as a regression problem with an unknown number of variables in the regression, and further, an unknown form of the regression equation itself. Nonparametric methods are ideally suited to this situation. Smoothing spline curve fitting (Eubank, 1988) and LOESS (Cleveland and Devlin, 1988) are popular examples of such nonparametric function estimation methods for functions of a single variable. Such methods have been shown to converge to the "true" function as the number of data grows, under suitable conditions.
Our approach uses a hybrid of a strictly parametric model (linear and quadratic logistic) with nonparametric (or "model- free") techniques. Such hybrid approaches have been extremely useful in a variety of applications (e.g., (Guardabasso, et al., 1988)). This approach allows us to build into the model parameters of known importance while still allowing a very general, perhaps unexpected model structure to emerge from the data.
Nonparametric methods generally require a trade-off between bias (due to fitting the wrong function to the data) and variance (not making efficient use of the data at hand, having too many parameters). This tradeoff may be controlled in our case, by setting a penalty factor. The tradeoff may also be thought of in terms of choosing how much to weight prior beliefs about the system versus how much weight to give the current data. This tradeoff must be made regardless of which technique is applied to the data, be it statistical- or artificial intelligence-based.
We utilize the maximum likelihood (ML) principle in calibrating our prediction scheme. This principle states that
Proceedings of the 25th Symposium on the Interface
Computing Science and Statistics, Vol. 25
San Diego, California, April 14-17, 1993