OF PROTEIN SECONDARY STRUCTURE
Peter J. Munson, Lihong Cao, Valentina Di Francesco and Raul Porrelli, National Institutes of Health
Peter J. Munson, NIH, Bldg 12A, Rm 2041, Bethesda, Md 20892
KEY WORDS: Protein Folding, Quadratic Logistic Model, Nonparametric Density Estimation, Crossvalidation
We have investigated the problem of prediction of protein secondary structure from amino acid sequence information, using parametric, semi-parametric and nonparametric approaches. In a three state model of secondary structure (helix, sheet, coil) which describes the local conformation of the protein polymer chain, we are able to attain an accuracy rate between 63 and 67% using each of these approaches. Maximum likelihood estimates are more satisfactory than the "information theory" method used by previous authors. In the fully parametric approach, parameter values have meaningful biophysical interpretations. The nonparametric approach is a variation of "homology" prediction methods familiar to molecular biologists. The semi-parametric approach produces a compromise result, providing some parametric information together with protection from model mis-specification. Computation intensive crossvalidation is necessary to establish the correct prediction rates.
Protein secondary structure prediction is a component of the "grand challenge" problem of protein folding. How proteins fold depends in a very complex way upon its primary amino acid sequence. But for a few hundred proteins, the full three-dimensional structure is known. Using this database, we attempt to build a predictive "model" for the folding process which can determine the location and type (alpha-helix, beta strand) of its secondary structure. Three statistical techniques are explored: parametric, nonparametric and semi-parametric. For the parametric model, we use the three-state logistic model which can be made quite general, and for which standard algorithms exist. Logistic models include several important structure prediction methods (e.g., the GOR "information theory"' method and single layer artificial neural nets) as special cases. A penalized version of the logistic model is in fact a semi-parametric model, in that it bridges the parametric and nonparametric settings. Kernel density estimation-discriminant function analysis serves as the nonparameric model, after mapping the problem to a suitable metric space.
The logistic models produce a set of probability estimates or "relative tendencies" for a residue to be in a particular structural state given the surrounding protein residue sequence. In our model, state probabilities are described by logistic functions of linear and quadratic functions of the independent variables. The full quadratic-logistic model is overly general and has more parameters than can be estimated from available data sets, so the model must be restricted in some way. We exploit the periodicity of alpha helix and beta strand secondary structures to obtain an important special case of the quadratic model that preserves the interpretability of the parameters.
The penalized logistic model adds further restrictions on the poorly determined quadratic logistic model parameters, and make possible optimal prediction compared with any of the parametric models tested. The penalized model produces estimates some of the biophysically interesting "contact propensities" for pairs of amino-acid residues.
The nonparametric method consists of two steps. First, the local sequence of each residue is mapped into a multidimensional metric space. To limit its dimensionality, a multidimensional scaling technique is applied to a recently published matrix of residue similarity scores (PAM matrix). Second, kernel density estimation is applied to reconstruct the "sequence" density of residues in the alpha, beta, and coil states. This nonparametric density forms the basis for the discriminant function.