In this practical you will be introduced to the CATH/Gene3D websites and servers that will help you in carrying out an investigation into protein structure and function.
We will begin by looking at the structure of a specific protein (FtsA) and how it can be split into its component domains. We will then investigate the superfamily of one of these domains further by looking at the types of biological functions and how the superfamily is distributed across different biological kingdoms. From this you should gain an understanding of both how to carry out this type of investigation and also, from the real example, how related proteins can vary substantially in structure and function and multi-domain context.
FtsA is essential for bacterial cell division and is found in the hypothermophilic bacterium Thermotoga maritima. This tutorial will show you how to use both CATH and Gene3D to explore this protein with the aim of understanding how evolutionary relatives are distributed across other organisms and with what biological processes they are associated. This investigation will be based around the highest resolution structure available for FtsA - (PDB code 1e4f).
First of all, you need to retrieve all the domains present in FtsA. You can use the CATHEDRAL server to do this. The CATHEDRAL server employs a structural comparison algorithm to compare the query structure against known domains in the CATH database, which means you can also use it to try and identify a unknown protein by comparing it with all known structures in CATH. You submit the protein for analysis at CATHEDRAL search page. You can download the PDB file for 1e4f from the PDB server but, in this case, as 1e4f is known to already exist in the CATH database you can enter the code 1e4f directly into the query box. Then wait for the results to arrive and answer the questions below.
'CATHEDRAL Results (shortcut)'
Depending on the size of the structure that you submit and the numbers of other users, the CATHEDRAL server can take upto an hour to return results. The link below jumps straight to the results from a search of 1e4f:
- By examining the CATHEDRAL results, how many unique domains does the server recognise in the protein structure of FtsA?
- Which superfamilies do they belong to?
- 3.30.420.40 (2 domains), 3.30.1490.110, 3.90.640.10
For the remainder of this investigation you are going to focus on one superfamily present in FtaA, the Nucleotidyltransferase domain family (CATH code: 3.30.420.40).
- What are the unique CATH domain identifiers for the 3.30.420.40 domain of 1e4fT?
- 1e4fT03 and 1e4fT04
A Brief Explanation of PDB Codes & CATH Domain Identifiers
PDB codes consist of 4 letters and numbers, typically of a form '1e4f', that specify a unique structural record. Each record may consist of more than one chain and so each chain will also have code that adds an extra number or letter to the main code (i.e. 1e4fT for chain T of 1e4f). CATH domain identifiers are derived from the relevant PDB code and are based (though not strictly) around the order of the domain the structure. So the four domains of chain 1e4fT are 1e4fT01, 1e4fT02, 1e4fT03 and 1e4fT04.
The next step is to investigate one of the domains in the CATH superfamily 3.30.420.40 using both CATH and Gene3D.
Brief Explanation of the CATH resource
CATH is a manually-curated hierarchical classification of protein domain structures. The name CATH derives from the initials of the top four levels of the classification - Class, Architecture, Topology and Homologous Superfamily.
- Class refers to the secondary structure content (e.g. mainly-alpha, mainly-beta, mixed alpha/beta or 'few secondary structures');
- Architecture refers to the general arrangement of the secondary structures irrespective of connectivity between them (e.g. alpha/beta sandwich);
- Topology, also known as the 'fold' level, takes into account the connectivity of secondary structures in the chain;
- Homologous Superfamily refers to domains that are believed to be related by a common ancestor.
The levels below this, the S-levels, are an automated clustering based on sequence identity.
To find the domain in CATH, enter the domain code (i.e. 1e4f followed by the chain ID and domain ID) into the search box Click Here and look at the result. If you look at the homologous superfamily level (the H-level) you will see that the domain has the code 3.30.420.40. Click on this level to view the structures of other known domains from this homologous superfamily.
- What is the type of fold members of this superfamily adopt?
- How is the architecture described?
- Would you describe this as a regular architecture?
- Answers: Nucleotidyltransferase domain 5
- 2-Layer Sandwich
This domain is found in a wide variety of apparently very different proteins with differing molecular and cellular functions. Clearly this needs further investigation. Clicking on the domain ids of the structures in this superfamily brings up more information on that particular domain. The Rasmol link () will allow you to view the structure. The PDBSum link at the bottom right hand corner of the page () should provide you with a summary of ligand interactions and other information.
As you have seen so far, structural data can be a very powerful way of viewing a protein. However, the structural data set, as represented by the PDB, is very sparse; for reasons of cost and ease there is far more sequence data, with associated annotation available. It is possible to predict structural domains in sequences by using 'profile HMMs'. For the next section of the practical you are going see how combining structural predictions with the sequence world can provide a rich view of evolution and function change.
Brief Explanation of Profile-HMMs
Hidden Markov Models are similar to a sequence profiles (like those used in PSI-BLAST) that model the amino acid distribution of a domain superfamily. These models are generated by creating alignments of many homologues and then counting the frequency of occurence for each amino acid in each column of the alignment (profile). This are then used to create probabilities of occurrence against a background evolutionary model that accounts for possible substitions. They provide a convenient and powerful way of identifying homology between sequences.
We are now going to use the Gene3D resource to explore this superfamily further. Gene3D provides access to more functional information, as well as taxonomic distributions, multi-domain architectures and protein-protein interaction (PPI) data.
We're going to start with the 1e4fT04 domain you have been investigating. You will take other examples from the superfamily later and make observations from Gene3D.
Follow this link to Gene3D. The Gene3D front page consists of several search options, to retrieve different data types (proteins, superfamily summaries, genome summaries, etc.). For now go use the protein retrieval search option by entering '1e4fT' into the search box (or follow this shortcut Click Here).
A page is return which contains a series of tabs containing structural and functional information related to 1e4fT. Compare the HMM-based predictions (CATH_HMM) to the Pfam domain assignments. Pfam families are normally derived from analysis of sequences rather than structures and so can often contain multiple structural domains that commonly co-occur. Click on the second 3.30.420.40 domain (the discontiguous one) and follow links out to CATH, Gene3D. You should also see some functional sub-classification of this domain in the pop-up.
- From the tabs does this protein look to be involved in the Cell cycle ?
- Yes there is some GO annotation from Uniprot supporting this association.
When you're ready, we can investigate other proteins from this superfamily.
Next try 1dkgD (or follow this shortcut Click Here).. From the domain architecture view we can see that the CATH_HMM architectures is quite detailed and complete.
Brief Explanation of Discontinuous Domains
The general concept of a domain is a continuous sequence of amino acids in a chain. However, the rules guiding folding are more complex than that. Whilst not well understood, it has been frequently observed that the sequence coding in a domain may be 'interrupted' by one or more other domains. It is worth noting that CATH has made a significant effort to describe discontinuous domains (which occur in 5-20% of all proteins) and profile HMMs have been developed to identify them properly.
- What species is the sequences corresponding to 1dkgD found in?
- From the summary tab we can see that this sequence is found in many strains of E.Coli
As an example of this group, search Gene3D with the pdb 1bdg (or follow this shortcut Click Here). The protein is a hexokinase in the parasitic worm Schistosoma mansoni. In this section you are going to search a CATH superfamily and restrict the results according to various criteria.
Looking at the domain architecture in the sequence features tab we can see that the protein has two CATH domains, clicking on the CATH domains we see they have different Funfam annotations. Funfams are subdivisions of CATH superfamilies providing functionally coherent groupings. We can retrieve proteins with similar functions (because they have the same Funfam assignments). by clicking the “Click here for functionally similar proteins” link.
- Are there functionally similar proteins in Plasmodium species ?
- Yes, the “filter rows” box can be used to help in establishing this by inputting the text 'plasmodium'.
For this family you are going to look at interactions. For this you are going to start with the Actin-related protein (arp) with the PDB code 1k8kA. Search this term in the Gene3D (or follow this shortcut Click Here).
- What type of cellular processes is this protein involved in?
- Cytoskeleton related processes
- Are there any drugs for this protein?
- Yes, an experimental one in Drugbank
Find the interaction with this protein by clicking on the interactors tab.
You can now start to investigate the sub-processes the interactors of this protein are involved in.