This page contains answers for the most frequently asked questions that we receive at CATH and is the best place to starting looking if you have a question about anything to do with the CATH resource.
Please note, these documentation pages are currently in their infancy so there may be some questions that don't yet have answers. This means that we know the question is important and we will document the answer as soon as we can.
The CATH database is a hierarchical domain classification of protein structures in the Protein Data Bank. Protein structures are classified using a combination of automated and manual procedures. There are four major levels in this hierarchy:
- Class - structures are classified according to their secondary structure composition (mostly alpha, mostly beta, mixed alpha/beta or few secondary structures).
- Architecture - structures are classified according to their overall shape as determined by the orientations of the secondary structures in 3D space but ignores the connectivity between them.
- Topology (fold family) - structures are grouped into fold groups at this level depending on both the overall shape and connectivity of the secondary structures.
- Homologous superfamily - this level groups together protein domains which are thought to share a common ancestor and can therefore be described as homologous.
For any given structure classified in the database, CATH gives you information on the structure and function of that protein. The evolutionary relationships involving the structure of interest and other proteins in the database can also be determined.
CATH also gives an overall view of the known protein structure universe to date. You can find which folds and superfamilies are the most populated, for example, and which structures are rare in nature.
Maintaining the CATH database is very much a team effort. Most of the members of the Orengo group have helped with the manual curation of the database and some have developed algorithms to aid with the automated aspects of maintaining and updating it.
Ian Sillitoe is the CATH Manager. Nicola Bordin is a Research Associate and is involved in algorithms development for the generation of Functional Families and their applications. Natalie Dawson was the CATH curator and a Research Associate in the group. Vaishali Waman is a Research Associate in the team and she is now the CATH curator.
CATH is a tree-like, hierarchical classification that starts off at the tree “trunk” by clustering protein domains into broad categories (e.g. C, or class, where domains are clustered solely based on their general secondary structure content). As the hierarchy moves away from the “trunk” to the “branches”, more stringent clustering criteria are applied to provide clusters of domains with finer granularity of similarity.
|1||C||Class||Secondary structure content|
|2||A||Architecture||General spatial arrangement of secondary structures|
|3||T||Topology||Spatial arrangement and connectivity of secondary structures (fold)|
|4||H||Homologous Superfamily||Manual curation of evidence of evolutionary relationship (at least two criteria from sequence/structure/function must be observed)|
|5||S||Sequence Family (S35)||>= 35% sequence similarity|
|6||O||Orthologous Family (S60) *||>= 60% sequence similarity|
|7||L||“Like” domain (S95) *||>= 95% sequence similarity|
|8||I||Identical domain (S100)||100% sequence similarity|
|9||D||Domain counter||Unique domains|
* We are aware that the names “Orthologous” and “Like” are by no means perfect descriptions of the clustering criteria that they represent. However we find it useful to provide some kind of label for these clusters and (quite frankly) these are the best we could come up with.
[from Ivan Kon on 20/10/2008]
CATH is a hierarchical classification that clusters protein structures at differing levels of similarity. The first level, Class, clusters proteins based on their general secondary structure content and is represented by the first number in the CATH code (the 'C' column in the table below).
A more detailed explanation on the numbering involved in sequence clusters (SOLID levels) can be found in this blog entry .
For a particular CATH version, for example 3.2.0, the first number indicates the most recent major CATH database release (i.e. version 3.0.0), whilst the second number indicates a minor release. Version 3.2.0 is therefore the second update of the major CATH release 3.0.0. The third number is used for internal purposes.
A domain identifier is assigned to every classified domain in the CATH database. It consists of a 4-character PDB code, for example 1kcm, followed by the chain name, denoted by a letter, and a two-digit domain number. If there is only one chain, it will be assigned the letter A in the same way as the first chain in a multi-chain structure. If there is only one domain in the chain then 00 is used for the domain number. The structure 1kcm has only a single domain in a single chain; the domain identifier will therefore be 1kcmA00.
This was implemented due to the emergence of protein structures with more than nine domains. As experimental techniques for solving crystal structures have improved, the determination of protein structures with a large number of separate domains has increased.
This was due to the wwPDB remediation project. Please click here for further information.
FunFams now have a more consistent numbering scheme based on the amount of sequences contained in the 'seed' alignments at their time of generation. FunFam 1 has the highest number or sequences, FF2 the second-highest, and so on.
A tutorial on how to search CATH can be found here
The answer to this is use the CATH webservices. However, the CATH webservices are undergoing a major revamp and are still in testing. We will update this section when we move the webservices to production.
If you would like us to link to your resource and there is a natural mapping from one of the CATH entities (PDB, PDB Chain, Domain, Classification, etc) then get in touch.