Vichetra Sam*, Chin-Hsien Tai?, Jean Garnier* §, Jean-Francois Gibrat§, Byungkook Lee?, Peter J. Munson*
*Mathematical and Statistical Computing Laboratory, DCB, CIT, NIH, ?Laboratory of Molecular Biology, CCR, NCI, NIH, DHHS, Bethesda, MD, USA and §Mathematique Informatique et Genome, INRA, Jouy-en-Josas, France
Protein structure comparison methods are useful in classification of protein domains. But, methods give different results which can strongly affect our view of the protein structure universe.
To study how these differences arise, we compared VAST (Vector Alignment Search Tool), SHEBA (Structural Homology by Environment Based Alignment) and SCOP (Structural Classification of Proteins). SCOP is a manually curated protein domain database, often considered the gold standard of protein structure classification. VAST aligns structures by clustering vectors representing secondary structure elements, used by NCBI/Entrez (National Center for Biotechnology Information). SHEBA structure comparison uses sequence homology and the environment profile to obtain an initial alignment.
We conducted an extensive statistical analysis of SHEBA and VAST on 4,676 SCOP domains having less than 40% pairwise sequence identity, comprising 468 SCOP folds. The area under the receiver operating characteristic (ROC) curve for detection of the SCOP fold is 0.93 for SHEBA and 0.90 for VAST. At 1% false positive rate, SHEBA yields a true positive rate (sensitivity) of 75% and VAST 62%.
Representation of the protein fold space by the confusion matrix allows
the identification of the differences between VAST and SHEBA, the quantification
and molecular interpretation of agreement, divergence between automatic
methods and SCOP. Results are examined in relation to the algorithms used
by VAST and SHEBA. Observed variation and ambiguity of some common cores
defining SCOP folds, together with the avalanche of new protein structures
from ongoing genomics projects underline the need for new procedures to
classify protein structures.
More detailed heatmaps, showing the organization of the protein fold space within each class, as computed by VAST and SHEBA can be downloaded here (in PDF format), or generated using the Matlab code.