CATH

README

Directory Structure

`releases/`

This directory contains all of the CATH-Gene3D database releases, from the first to the latest. All previous releases are in releases/previous-releases/ and the latest release is in releases/latest-release/.

`releases/daily-release/`

This directory provides summary information of protein domains putatively classified in CATH since the last release.

For each date with a CATH-B entry, there should be five files, e.g.:

cath-b-20170519-all.gz: combination of the 'latest release' and 'putative entries' files

cath-b-20170519-latest-release.gz: all domains that were in the latest release of CATH

cath-b-20170519-putative.gz: the domains assigned/rechopped/reassigned in CATH since the latest release

cath-b-20170519-names-all.gz: name description of each node in the CATH hierarchy. A combination of the 'latest release' and 'putative entries' files

cath-b-20170519-s35-all.gz: all domain ids in CATH-B with their S35 cluster id and domain boundary information.

Notes:

These are compressed files; once downloaded you should uncompress them with a suitable program (e.g. gunzip) before use.
Please note that the 'latest release' and 'putative' files have no domains in common.
The first three files use one line per domain and use the following format:

domain_id status putative_superfamily_id putative_chopping
The names file follows the file format described in ./README-cath-names-file-format.txt (a.k.a. CNF format).

`./archive/`

This subdirectory contains all of the CATH-B files, except the files for the current day.

`./newest/`

This subdirectory contains the five CATH-B files for the current day:

cath-b-newest-all.gz
cath-b-newest-latest-release.gz
cath-b-newest-putative.gz
cath-b-newest-names-all.gz
cath-b-s35-newest.gz

`releases/latest-release/`

This subdirectory contains the latest release of CATH. Please note that these files do not contain a version number.

`releases/all-releases/`

Each subdirectory contains all of the CATH releases and is named according to its version number:

v2_0
v2_4
v2_5
v2_5_1
v2_5_3
v2_6_0
v3_0_0
v3_1_0
v3_3_0
v3_4_0
v3_5_0
v4_0_0
v4_1_0
v4_2_0
v4_3_0

`releases/<release-type>/<version>/cath-classification-data/`

Files within this directory contain data describing the CATH classification.

cath-chain-list-.txt: Lists all of the PDB chain IDs in CATH, whether they are chopped into domains or not. For file format description see ./README-cath-list-file-format.txt (a.k.a. CLF format).

e.g. cath-chain-list-v4_3_0.txt

cath-domain-boundaries-*-.txt: Description of domain and segment boundaries for domains classified into CATH. For file format description see ./README-domain-boundaries-file-format.txt (a.k.a. CDF format).

e.g. cath-domain-boundaries-v4_3_0.txt
e.g. cath-domain-boundaries-seqreschopping-v4_3_0.txt

cath-domain-description-file-.txt: Description of each protein domain in CATH (see README.CDDF_FORMAT_2.0 for more details). For file format description see ./README-cath-domain-desc-file-format.txt (a.k.a. CDDF format).

e.g. cath-domain-description-file-v4_3_0.txt

cath-domain-list--.txt: Lists of domains classified into CATH. For file format description see ./README-cath-list-file-format.txt (a.k.a. CLF format).

e.g. cath-domain-list-S35-v4_3_0.txt

cath-domain-pdb-*-.txt: Description of each domain PDB classified into CATH.

e.g. cath-domain-pdb-v4_3_0.tgz
e.g. cath-domain-pdb-S35-v4_3_0.tgz
These are compressed files; once downloaded you should uncompress them with a suitable program (e.g. gunzip) before use.

cath-names-.txt: Name description of each node in the CATH hierarchy, along with an example domain. For file format description see ./README-cath-names-file-format.txt (a.k.a. CNF format).

e.g. cath-names-v4_3_0.txt

cath-superfamily-list-.txt: List of all the superfamilies in the CATH hierarchy.

e.g. cath-superfamily-list-v4_3_0.txt

cath-unclassified-list-.txt: List of all unclassified protein chains and domains that are still being processed. For file format description see ./README-cath-list-file-format.txt (a.k.a. CLF format).

e.g. cath-unclassified-list-v4_3_0.txt

`releases/<release-type>/<version>/non-redundant-data-sets/`

The non-redundant data sets contain a non-redundant subset of CATH domains that: * have no pair of domains (according to BLAST) with >= 20 or 40% sequence identity (depending on the data set chosen), over 60% overlap (over the longer sequence * is as big as we could make it otherwise.

Files

cath-dataset-nonredundant-S[20|40]-v4_1_0.atom.fa: The ATOM sequences of the domains in the dataset (which only contain residues that have ATOM records in the PDB file)

cath-dataset-nonredundant-S[20|40]-v4_3_0.fa: The sequences of the domains in the dataset

cath-dataset-nonredundant-S[20|40]-v4_3_0.list: A list of the domains in the dataset; one domain ID per line

cath-dataset-nonredundant-S[20|40]-v4_3_0.pdb.tgz: (A gzipped tar file containing) the PDB files of the domains in the data set

Method of Construction

The sequence comparisons are performed with an all-against-all BLAST of our domain sequences. We then use these results to identify any links with: * >= 40% sequence identity ( ie pident >= 40 ) and * >= 60% overlap over the longer sequence ( ie 100.0 * length / max(slen, qlen) >= 60 )

We use this to form a list of domains that contains no pair of linked entries. In an effort to make the list as large as possible, we build the list by iteratively choosing each domain to add to the list, ensuring that a domain is only added if it has as few linked neighbours as any other domain. This means the algorithm should nibble as many edges off a cluster as possible, rather than taking a small number of domains at the cluster's centre.

`releases/<release-type>/<version>/sequence-data/`

This directory contains protein domain sequence-based data.

cath-domain-seqs-*-.fa: Sequences for each CATH domain.

e.g. cath-domain-seqs-S35-v4_3_0.fa

Hidden Markov model (HMMs) libraries are provided for the S35 rep sequence clusters and the functional families (FunFams). A HMM is generated for each S35 sequence cluster and for each functional family using hmmbuild from the HMMER3 software package. All of the S35 sequence cluster HMMs and functional family HMMs are concatenated to create these two HMM library files:

cath-S35--hmm3.lib.gz
funfam-hmm3-.lib.gz

These are compressed files; once downloaded you should uncompress them with a suitable program (e.g. gunzip) before use. The program hmmpress should then be run on each of these files to construct binary compressed data files.

`./sequence-by-superfamily/`

cath-superfamily-seqs--.fa: Sequences for each CATH superfamily in FASTA format. The files have this format:

e.g. cath-superfamily-seqs-1.10.10.10-v4_3_0.fa

`./supplementary-files/`

This directory contains any supplementary files that are associated with a particular release.

`supplementary-materials/`

This directory contains supplementary material for published work from the group. Each subdirectory represents a different publication.

`./2015_nar_cath-funfhmmer-web-server/`

FunFHMMER-web-server-supplementary-table.xls

./2016_ploscompbiol_functionally-classifying-and-characterising-serine-beta-lactamases/

151-types-uniprot-cath-gene3d.dat
SSPA-mutant-positions-extended-spectrum-resistance.dat

FILE BROWSER

Note: this download area can be accessed directly via https://download.cathdb.info/cath