Which pam matrix to use




















They seem quite similar: both contain one " indel " and one substitution, just at different positions. However, if we think of the letters as amino acid residues rather than elements of strings, alignment a is the better one, because isoleucine I and leucine L are similar sidechains, while tryptophan W has a very different structure. This is a physico-chemical measure; we might prefer these days to say that leucine simply substitutes for isoleucine more frequentlywithout giving an underlying "reason" for this observation.

However we explain it, it is much more likely that a mutation changed I into L and that W was lost, as in a , than that W changed into L and I was lost. We would expect that a change from I to L would not affect the function as much as a mutation from W to Lbut this deserves its own topic. To quantify the similarity achieved by an alignment, scoring matrices are used: they contain a value for each possible substitution, and the alignment score is the sum of the matrix's entries for each aligned amino acid pair.

For gaps indels , a special gap score is necessarya very simple one is just to add a constant penalty score for each indel. The optimal alignment is the one which maximizes the alignment score. PAM matrices are a common family of score matrices. PAM stands for P ercent A ccepted M utations , where "accepted" means that the mutation has been adopted by the sequence in question. Thus, using the PAM scoring matrix means that about mutations per amino acids may have happened, while with PAM 10 only 10 mutations per amino acids are assumed, so that only very similar sequences will reach useful alignment scores.

PAM matrices contain positive and negative values: if the alignment score is greater than zero, the sequences are considered to be related they are similar with respect to the used scoring matrix , if the score is negative, it is assumed that they are not related. Finally, it should be noted that only some scoring matrices use similarity to evaluate alignments, but others use distance , so the be careful interpreting the results!

A single pair of sequences does not contain enough information to allow us to determine a scoring scheme; Therefore, we need to compare multiple sequences, but unfortunately, construction of multiple sequence alignments is computationally a hard problem which is not possible to solve optimally. In an attempt to reduce errors the estimation of the frequencies are based on closely related sequences and alignments with no insertions or deletions; Thus, resulting in scoring systems without gap scores.

The selection of a scoring matrix depends on our goal whether we are using them to search a database or to align known sequences and wish to maximize the alignment accuracy. In database searches, the primary concern is to find matches that are statistically significant and thus discriminating matches from chance.

Once, we have identified a correct sequence family we should make a custom scoring matrix using the information available in multiple sequences in that family instead of a general one to fine tune alignments or search increasingly distant homologs.

However, the algorithm is sensitive for each sequence inclusion and thus one must be careful of not including incorrect sequences that may result in a blend of families. Besides, the resulting scoring is dependent on the order of inclusion of sequences to the alignments. Please, see the subsection related below. In summary, the resulting alignments are dependent on the algorithm global or local alignment, the scoring scheme, the evolutionary distance of the aligned sequences, and the gap penalty scheme.

With increasing evolutionary distance the available information decreases and consequently increasingly long sequence alignments are required to collect enough information so that the alignments are distinguishable from random alignments.

Note that the calculation assumes alignments with no gaps and that the length of similar blocks between sequences tends to decrease with increasing divergence. With gaps in the alignments, the minimum lengths would increase. Consequently, to find significant distant homologs is a more challenging task than finding closely related sequences. Every scoring scheme is either based on some overall percentage of similarity of sequences or implies a similarity percentage.

With increasing evolutionary distance or divergence the frequency of matching residues decreases and vice versa. The reason why L-L matches score lower than W-W matches is that Leucine is more abundant than Tryptophan; Consequently, the chance of randomly getting a W-W pairing is lower than getting an L-L pairing.

Furthermore, when the base of the logarithm is two, the scores are in bits. In general, as also is the case in BLOSUM matrices, the scores are further scaled to represent multiples of half bits, i. The examples are just a generalization of the concept. However, any scoring scheme whether based on sets of real sequences or theoretical reasoning imply a specific target frequency, i.

So what is the effect of using a scoring matrix optimized for, e. Theoretically, the efficiency of scoring matrices decreases with increasing distance from the optimal similarity, i. We can observe that by deviating from the target frequency, the minimum alignment length to attain statistical significance increases, which is an important observation. Regardless of which substitution matrix we use, modern search tools efficiently prevent us getting false matches by employing efficient statistics, on the other hand, we may miss relevant matches to short well-conserved domains.

Scoring matrices for protein sequences behave the same way as DNA scoring matrices. Their efficiency also decreases with increasing divergence from the optimal target frequency. Figure 3 shows the efficacy of a selection of five PAM matrices with an underlying frequency deviating from their target frequency.

As with DNA and PAM matrices, the maximum efficiency is one when the target frequency matches the actual underlying frequency. Information content among different scoring matrices varies. This information content, usually given as bits per position, decreases with decreasing target similarity and with increasing deviation from the optimal target frequency Figures 2, 3, and 4. The efficiency is on when the target frequency equals the actual distance. Figure 4. Scoring with low information content matrices requires longer alignment lengths than scoring with high information content matrices for an alignment to be statistically significant.

If we were to use VTML10 matrix instead, which gives 3. All this is essential knowledge since although statistics most of the time prevent us from getting false hits, we may miss short conserved domains when using matrices that target low similarities.

Note that these values apply to scores using a matrix whose target frequency matches a real underlying frequency. By deviating from the optimal target frequency, the minimum statistically significant alignment lengths increase accordingly. DNA sequences are often more divergent than protein at an equal evolutionary distance because codons are redundant, primarily the third codon which is the most wobbly one.

A codon consisting of three bases can code for 64 different amino acids, but only 20 is in use, so the remaining 44 codons code for the same amino acid and three stop codons. Methionine is the only exception which codes for a start; Therefore, many mutations in DNA keep the codons coding for the corresponding amino acid, such mutations are synonymous or silent mutations.

Mutations that result in codons to code a different amino acid are non-synonymous. In a residue per residue, comparison DNA contains less information than protein.

One DNA base can contain up to two bits and an amino acid up to about 4. Should we then compare DNA instead of protein sequences? Well, it depends on what our goals with comparisons are. Firstly, if we translate DNA into protein, we again end up having only up to 4. Secondly, to capture the whole DNA information content a sequence alignment algorithm needs to account for the triplets and not only align base by base manner. However, it is possible to question to what end this is useful since most of the changes in DNA do not result in a different amino acid.

On the other hand, just by counting differences between DNA sequences, we can estimate evolutionary distances. The scoring and information content in matrices are from sequence alignments with no gaps, and thus the inclusion of gaps in alignments decreases the information content. There is no theoretical mathematical model for gapped alignments for various evolutionary distances because to estimate background frequencies of gaps is challenging and furthermore these background frequencies are likely to change with different gap penalty values; Consequently, gap penalties need to be determined by simulations and empirical studies.

The setting of gap penalties is probably the most crucial in short evolutionary distances, and low gap penalties are inefficient in identifying highly similar sequences just as is the case with scoring matrices themselves.

However on the other hand, while we want alignments to contain as few gaps as possible, we want them to be long enough to achieve statistical significance, presenting a sort of a dilemma. The rule of thumb for gap penalties follows the same logic as in using scoring matrices to score highly similar sequences giving high penalties for substitutions and low penalties for highly divergent sequences.

So, searching for highly similar sequences the gap penalties should be high and for divergent sequences low. By decreasing gap penalties the alignment lengths tend to increase. Reese and Pearson performed simulations to maximize the performance of short alignments and suggest that for a half-bit scoring matrices, gap open penalties of 15 and gap extension penalty of 2 for VTML20 are the most efficient.

The allowance of gaps in alignments reduces the information content considerably. The effect is the most profound within short evolutionary distances. For instance, PAM30 with the gap open penalty of nine and the extension penalty of one yields 0.

Local alignment algorithms find local high scoring segments between sequences and programs such as SSEARCH and BLAST will always only report alignments that between sequences that share a statistically significant domain. However, the exact length of the alignment depends on the location of the domain and the scoring matrix used.



0コメント

  • 1000 / 1000