SEMIPARAMETRIC AND KERNEL DENSITY ESTIMATION PROCEDURES FOR PREDICTION

OF PROTEIN SECONDARY STRUCTURE

Peter J. Munson, Lihong Cao, Valentina Di Francesco and Raul Porrelli, National Institutes of Health

Peter J. Munson, NIH, Bldg 12A, Rm 2041, Bethesda, Md 20892


KEY WORDS: Protein Folding, Quadratic Logistic Model, Nonparametric Density Estimation, Crossvalidation


We have investigated the problem of prediction of protein secondary structure from amino acid sequence information, using parametric, semi-parametric and nonparametric approaches. In a three state model of secondary structure (helix, sheet, coil) which describes the local conformation of the protein polymer chain, we are able to attain an accuracy rate between 63 and 67% using each of these approaches. Maximum likelihood estimates are more satisfactory than the "information theory" method used by previous authors. In the fully parametric approach, parameter values have meaningful biophysical interpretations. The nonparametric approach is a variation of "homology" prediction methods familiar to molecular biologists. The semi-parametric approach produces a compromise result, providing some parametric information together with protection from model mis-specification. Computation intensive crossvalidation is necessary to establish the correct prediction rates.


Introduction

Protein secondary structure prediction is a component of the "grand challenge" problem of protein folding. How proteins fold depends in a very complex way upon its primary amino acid sequence. But for a few hundred proteins, the full three-dimensional structure is known. Using this database, we attempt to build a predictive "model" for the folding process which can determine the location and type (alpha-helix, beta strand) of its secondary structure. Three statistical techniques are explored: parametric, nonparametric and semi-parametric. For the parametric model, we use the three-state logistic model which can be made quite general, and for which standard algorithms exist. Logistic models include several important structure prediction methods (e.g., the GOR "information theory"' method and single layer artificial neural nets) as special cases. A penalized version of the logistic model is in fact a semi-parametric model, in that it bridges the parametric and nonparametric settings. Kernel density estimation-discriminant function analysis serves as the nonparameric model, after mapping the problem to a suitable metric space.


The logistic models produce a set of probability estimates or "relative tendencies" for a residue to be in a particular structural state given the surrounding protein residue sequence. In our model, state probabilities are described by logistic functions of linear and quadratic functions of the independent variables. The full quadratic-logistic model is overly general and has more parameters than can be estimated from available data sets, so the model must be restricted in some way. We exploit the periodicity of alpha helix and beta strand secondary structures to obtain an important special case of the quadratic model that preserves the interpretability of the parameters.


The penalized logistic model adds further restrictions on the poorly determined quadratic logistic model parameters, and make possible optimal prediction compared with any of the parametric models tested. The penalized model produces estimates some of the biophysically interesting "contact propensities" for pairs of amino-acid residues.


The nonparametric method consists of two steps. First, the local sequence of each residue is mapped into a multidimensional metric space. To limit its dimensionality, a multidimensional scaling technique is applied to a recently published matrix of residue similarity scores (PAM matrix). Second, kernel density estimation is applied to reconstruct the "sequence" density of residues in the alpha, beta, and coil states. This nonparametric density forms the basis for the discriminant function.


Background

Since the early 1970's biochemists have recognized that the secondary structural state of protein residues could be predicted, at least partially, from the local amino acid sequence. Chou and Fasman (Chou and Fasman,1978, Chou and Fasman,1978) first computed "conformational parameters" or preference indices of various residues for being in each structural state. Gamier and others (Gamier, et al.,l978) presented a method, now termed the "GOR" method, for combining information for several residues in a local sequence, and were able to obtain a prediction rate of 57%. Later this group boosted the claimed prediction rate to<
 
Proceedings of the Stat. Computing Section
American Statistical A.ssociaticn
107 San Francisco, California, August 8-12, 199