Structural Classification of Proteins

Introduction to Structural Classification of Proteins

Alexey G. Murzin, Steven E. Brenner, Tim J. P. Hubbard, and Cyrus Chothia
MRC Laboratory of Molecular Biology and Centre for Protein Engineering
Hills Road, Cambridge CB2 2QH, England

scop@mrc-lmb.cam.ac.uk

Introduction

Nearly all proteins have structural similarities with other proteins and, in some of these cases, share a common evolutionary origin. A knowledge of these relationships is crucial to our understanding of the evolution of proteins and of development. It will also play an important role in the analysis of the sequence data that is being produced by worldwide genome projects.

The SCOP database aims to provide a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known, including all entries in the Protein Data Bank (PDB). It is available as a set of tightly linked hypertext documents which make the large database comprehensible and accessible. In addition, the hypertext pages offer a panoply of representations of proteins, including links to PDB entries, sequences, references, images and interactive display systems. World Wide Web URL http://scop.mrc-lmb.cam.ac.uk/scop/ is the entry point to the database (MRC site).

Existing automatic sequence and structure comparison tools cannot identify all structural and evolutionary relationships between proteins. The SCOP classification of proteins has been constructed manually by visual inspection and comparison of structures, but with the assistance of tools to make the task manageable and help provide generality. The job is made more challenging--and theoretically daunting--by the fact that the entities being organized are not homogeneous: sometimes it makes more sense to organize by individual domains, and other times by whole multi-domain proteins.

Classification

Proteins are classified to reflect both structural and evolutionary relatedness. Many levels exist in the hierarchy, but the principal levels are family, superfamily and fold, described below. The exact position of boundaries between these levels are to some degree subjective. Our evolutionary classification is generally conservative: where any doubt about relatedness exists, we made new divisions at the family and superfamily levels. Thus, some researchers may prefer to focus on the higher levels of the classification tree, where proteins with structural similarity are clustered.

The different major levels in the hierarchy are:

Family: Clear evolutionarily relationship
Proteins clustered together into families are clearly evolutionarily related. Generally, this means that pairwise residue identities between the proteins are 30% and greater. However, in some cases similar functions and structures provide definitive evidence of common descent in the absense of high sequence identity; for example, many globins form a family though some members have sequence identities of only 15%.
Superfamily: Probable common evolutionary origin
Proteins that have low sequence identities, but whose structural and functional features suggest that a common evolutionary origin is probable are placed together in superfamilies. For example, actin, the ATPase domain of the heat shock protein, and hexakinase together form a superfamily.
Fold: Major structural similarity
Proteins are defined as having a common fold if they have the same major secondary structures in the same arrangement and with the same topological connections. Different proteins with the same fold often have peripheral elements of secondary structure and turn regions that differ in size and conformation. In some cases, these differing peripheral regions may comprise half the structure. Proteins placed together in the same fold category may not have a common evolutionary origin: the structural similarities could arise just from the physics and chemistry of proteins favoring certain packing arrangements and chain topologies.

Usage

We hope that SCOP will have broad utility that will attract a wide range of users. Experimental structural biologists may wish to explore the region of "structure space" near their proteins of current research, while theoreticians will likely find it most useful to browse the wide range of protein folds currently known. Molecular biologists may find the classification helpful because the categorization assistis in locating proteins of interest and the links make exploration easy. We also hope that SCOP will find pedegogical use, for it organizes structures in an easily comprehensible manner and makes them accessible from even a simple personal computer.

Acknowledgements

The SCOP authors thanks Sean Eddy, Graeme Mitchison, Erik Sonnhammer, and others in the computational molecular biology discussion group for useful suggestions. Thanks to Roger Sayle (author of rasmol) for suggesting the tcl/tk interface to rasmol. The University of Cambridge School of Biological Sciences is thanked for providing the principal database access point in the initial years. The mirror sites are thanked for providing local access to SCOP. AGM and CC are grateful to MRC for support. SEB is grateful to Herchel Smith and Harvard University, St. John's College, Cambridge Overseas Trust, American Friends of Cambridge University and CVCP/ORS for support. TJPH is grateful to ZENECA for support.

June 2009