The following is a reply from a recent email that may provide a useful explanation to others regarding the CATH code numbering with sequence clusters (SOLID).
Received on Aug 21, 2008 (reproduced with permission)
Hi, CATH team: Upon reading the original paper published in 1997 and visiting your website, I am still confused about the definition of CATH code in sequence family levels. Take the CATH 3.1 reflected by these three below proteins as a example, my questions of them were as following: 1) the code in CATHSOLI level of 2a8vA01 and 1a8vA01 are all the same, but why their codes in D level were different? Does 2a8vA01 and 1a8vA01 not belong to the same s100 family? 2) the code in CATHSO ID level of 1a8vA01 and 1a62001 are all the same, but why their codes in L level were different? If 1a8vA01 and 1a62001 are 100% sequence identical, why they were assigned to different 95% sequence group? Sincerely. backy 35% 60% 95% 100% C A T H S O L I D 2a8vA01 1 10 720 10 2 1 2 1 3 47 2.400 1a8vA01 1 10 720 10 2 1 2 1 1 49 2.000 1a62001 1 10 720 10 2 1 1 1 1 44 1.550
The reply on Aug 21, 2008
Hi Backy, Thanks for getting in touch with us, hopefully I can answer your questions below: 1) the code in CATHSOLI level of 2a8vA01 and 1a8vA01 are all the same, but why their codes in D level were different? The D level stands for "Domain Count" and is just there to provide a unique code for every domain - so if two domains are identical (i.e. they share everything up to the I, or 100% Identical, code) then we use the D level to differentiate between them - this is just a sequential counter. Does 2a8vA01 and 1a8vA01 not belong to the same s100 family? Yes, they do - they share up to the I count so they are 100% identical - as mentioned above - the domain level is just a counter to differentiate between domains in the same I cluster. 2) the code in CATHSO ID level of 1a8vA01 and 1a62001 are all the same, but why their codes in L level were different? You need to bear in mind that CATH is a tree-like hierarchy with the trunk of the tree represented on the left of the CATHSOLID classification (e.g. the C code) and the leaves of the tree on the right (e.g. the D code). In the example you give above - you have to read the CATH codes from left to right and stop the first time one of the codes differs. In this case, they differ at the 'L' code so they are in different S95% clusters. It doesn't matter that the numbers after this (I, D) are the same as they are talking about different branches of the tree. If 1a8vA01 and 1a62001 are 100% sequence identical, why they were assigned to different 95% sequence group? The simple answer is that they aren't 100% identical - they have a seq id of 94.7% so they are in different L codes. As mentioned above, the I and D happen to be the same, but that doesn't mean anything if the L code is different (CATHSOLID needs to be read from left to right). So for the following three domains: 2a8vA01 1.10.7126.96.36.199.2.1.3 1a8vA01 1.10.7188.8.131.52.2.1.1 1a62001 1.10.7184.108.40.206.1.1.1 The tree/hierarchy would look something like: C 1 A 10 T 720 H 10 S 2 O 1 L 1 2 I 1 1 D 1 1 3 1a62001 1a8vA01 2a8vA01 This seems like a good question/answer to add to our FAQ section of the website - would you mind? Best wishes, Ian Sillitoe CATH Team