A very simple description
Given:
This program:
More detail/background
Proteins often belong to large homologous families. While most proteins in such families will most often show similarities in their molecular functions, they show important differences that are often related to details like enzymatic substrate specificity, or prefered interacting partners. In such instances it is important to understand how the family has evolved to perform these different functions. For "orphan" sequences, it is also very useful to be able to predict which of the sometimes many different functions that are performed by a family the sequences is most likely to be.
PROUST-II uses HMMer hidden Markov Model profiles to do just this. A very detailed description can be found in the reference below (Hannehalli & Russell, 2000). Essentially, the method subtracts profiles for sequences that are not in a group from the profile for the sequences in the group. This leaves only those parts of the profile that are unique to the group. After identifying positions that are mostly likely to be conferring the sub-types, it uses them to predict sub-types for those sequences that are in the alignment but are not in any group ("orphans").
There is no need to specify the format in advance, as the program will guess. However, if you get a message saying that it can't read the format, specifying the format may be necessary (i.e. PROUST-II can't guess the correct format.
Back to the topA groups file must a line for each identifier (e.g. 100K_RAT, P123456, gi1234323, etc.)
from the alignment that is known to be in a group. Each line must contain at least
two fields, one with the identifier and one with the group. By default, the first
field is expected to be the identifier, and the second the group, but you can change
this if you like on the submit form. For example:
CYAB_STIAU/9-194 ADC
CYAG_DICDI/387-572 ADC
CYAG_DICDI/655-825 ADC
CYG2_RAT/399-582 GUC
CYG3_BOVIN/473-662 GUC
CYG3_CAEEL/889-1077 GUC
So here there are two groups, ADC (= adenylate cyclase) and GUC (= guanylate cyclase), each
with three identifiers.
Let's have a look at how PROUST-II works. Imagine that you are presented with this sequence:
>AAK45954 purine cyclase-related protein [Mycobacterium tuberculosis CDC1551] MRLVPQTPRSSLPGSARTTYPCHVEVGPQDSESGAPDETATAMASPVPRQRSALRWLRTVNRSPGLVSFI HRARRLLPGDPEFGDPLSTAGEGGPRAAARAADRLLRDRDAASREVGLSVLQVWQALTEAVSRRPANPEV TLVFTDLVGFSTWSLHAGDDATLTLLRQVARAVESPLLDAGGHIVKRLGDGIMAVFRNPTVALRAVLVAQ DAVKSLEVQGYTPRMRIGIHTGRPQRLAADWLGVDVNIAARVMERATKGGIMISQPTLDLIPQSELDALG VVARRVRKPVFASKPTGIPPDLAIYRIKTVSESTAADNFDEMSPDAQ
Say also that you manage to find out (e.g. by BLAST) that it belongs to a family of proteins that contain adenylate and guanylate cyclases, and you get an alignment of this sequence to this family (e.g. see the file cyclase.aln; adapted from an alignment from Pfam). Moreover, you know that this family performs a cyclisation reaction that takes either ATP or GTP and converts it to cAMP or cGMP, and you also have "grouped" a set of other sequences into their groups (e.g. see the file cyclase.gr) . The "either" is important here, as this is what we are going to try and predict using PROUST-II. Note also that this was probably also a problem for the genome annotators. Notice that the sequence is called a "purine" (i.e. adenylate or guanylate) cyclase. I suspect that this is probably because the matches (with BLAST) to the other members of the family are quite weak, making it difficult to state whether it is ATP or GTP specific by looking, for example, at BLAST E-values.
So, we would like to know first of all what determines whether a cyclase is ATP or GTP specific, and secondly what is the likely specificity of this new protein. Try running the program with these files, or just trust me that what you get (or should get) with the defaults will look something like what appears below.
Back to the topSequence | Prediction | Confidence | Pred score | Individual scores |
AAK45954 | LOW | -4.69 | GUC -21.2 ADC -11.6 |