Protein Secondary Structure Prediction using Periodic-Quadratic-Logistic
Models: Statistical and Theoretical Issues

Peter J. Munson, Valentina Di Francesco, Raul Porrelli

Analytical Biostatistics Section, Laboratory of Structural Biology, Division of
Computer Research and Technology, National Institutes of Health, Bethesda, Md 20892

Abstract

We extend logistic discriminant function methodology to compete effectively with neural networks and "information theory" methods in prediction of protein secondary structure. Unlike "black-box" methods, our model produces 400 pairwise interaction parameters which are interpretable from a molecular standpoint. Under optimal conditions, our model can produce up to 65.9% crossvalidated prediction accuracy on three states. A broad family of models is searched using a semi-parametric (penalized) approach combined with stepwise parameter selection. We show that optimal models have about 800 effective parameters for this data set. The highest prediction accuracy is concentrated in a fraction of the total residues, and the confidence of a prediction can be easily calculated. Such high-confidence predictions may be useful as the basis for prediction of the complete structure of the protein.



Introduction

The protein secondary structure prediction problem has become a classic, challenging problem for the artificial-intelligence and machine learning community. Virtually every conceivable computational technique in these fields (e.g., information theory [6, 12, 13], artificial neural networks [15, 20, 22], cascaded networks [18, 19, 27], hybrid systems [28], nearest neighbor methods [21], hidden markov chains [4], machine learning [17, 25], mutual information [26]) has been applied in the context of protein structure prediction. The reason for this attention is well-founded and clear: If protein structure, even secondary structure, can be accurately predicted from the now abundantly available gene and protein sequences, such sequences become immensely more valuable for the understanding of drug- design, the genetic basis of disease, the role of protein structure in its enzymatic, structural, and signal transduction functions, and basic physiology from molecular to cellular, to fully systemic levels. In short, the solution of the protein structure prediction problem (and the related protein folding problem) will bring on the second phase of the molecular biology revolution.
    While the computation science community has become extensively concerned with the secondary structure prediction problem, it seems that many classical but powerful statistical methods have been ignored. We have generalized a well-known procedure for statistical discriminant analysis in such a way as to attain a prediction rate comparable to or exceeding that of machine-learning techniques. While there is almost certainly a correct, though possibly complex rule for predicting structure from sequence [3], artificial intelligence and statistical methods currently seem to be prevented from discovering it owing to the limited available data. This problem will likely be solved only as additional a priori knowledge and information about the physics of proteins is built into the prediction machinery.
    We have taken a purely statistical approach, viewing the situation as a regression or classification problem with an unknown number of variables, and further, an unknown form for the regression equation itself. Our approach is a hybrid of a strictly parametric model (linear and quadratic logistic) with nonparametric (or "model-free") techniques. Such hybrid or semiparametric approaches have been extremely useful in other applications (e.g., [14]). This approach allows incorporation of variables of known importance while still allowing a very general, perhaps unexpected model structure to emerge from the data.
    In artificial neural networks, model complexity is attained by adding nodes or hidden layers to the network. In our approach, complexity is controlled either by adding or removing terms of the regression model, or by adjusting the "smoothing parameter" which effectively allows continuous adjustment of the number of parameters in the model. In this way, we are able to find a near optimal model within a very large class of complex models. The technique is exactly analogous to adjusting the "smoothness" of a spline drawn to represent noisy data points in one- or two- dimensions

1060-3425/94 $3.00 [copyright symbol] 1994 IEEE 375 Proceedings of the Twenty-Seventh Annual Hawaii
International Conference on System Sciences, 1994