Detecting Natural Selection (Part 3)
Codon Based Models for Detecting Selection
This is the fourth of multiple postings I plan to write about detecting natural selection using molecular data (ie, DNA sequences). The first post contained a brief introduction and can be found here. The second post described the organization of the genome, and the third described the organization of genes.
As I mentioned in the previous post in this series, we can divide genes into protein coding sequence and non-coding sequence. Protein coding sequences are made up of codons (sets of three nucleotides) which encode amino acids. Because of the redundancy of genetic code (64 possible codons, but only 20 amino acids), many amino acids are encoded by multiple codons (see the codon table above). Nucleotide substitutions that do not change the amino acid encoded by a codon are said to be synonymous (or silent). Those that do change the amino acid are non-synonymous (some people refer to these as replacement substitutions).
Tests for natural selection often compare one set of nucleotides (or sites) which may be under selection to another set of nucleotides (or sites) that are probably not under selection. The sites that are not under selection are said to be evolving under neutral processes, and they act as a scientific control. The patterns of evolution at the sites that may be under selection (the experimental group) are then compared to the neutral sites. If both sets are evolving similarly, then we fail to reject neutrality at the second set of sites. Conversely, if the two types of sites are evolving at different rates, we reject neutrality.
If we assume that selection acts on the amino acid sequence (the protein encoded by the gene) then the synonymous substitutions should be selectively neutral. This is not necessarily true, and is a legitimate concern in some instances, but we will ignore it here. We can count the number of synonymous and non-synonymous substitutions in a gene if we have a copy from two different species (for example, human and mouse). To standardize for differences in the number of synonymous and non-synonymous sites (there are more than twice as many non-synonymous sites in a coding sequence because, for the most part, only the third codon position is redundant), we calculate the number of synonymous differences per synonymous sites (kS or dS) and the number of non-synonymous differences per non-synonymous sites (kA or dN). The two types of statistics (dS and dN versus kS and kA) differ in how synonymous and non-synonymous sites are calculated -- for this discussion, I will be using dS and dN.
Once we have calculated the fraction of synonymous (dS) and non-synonymous (dN) sites that differ between the two sequences, we can compare them. Each statistic is a proportion, so the possible values range from zero (no differences) to one (all sites are different). (An aside for anyone interested: Because two random nucleotide sequences are expected to match at 25% of their sites, values of dS and dN greater than 0.75 are theoretically unobtainable, and various corrections have been developed to calculate the actual number of substitutions for more diverged sequences.) Assuming that dS is an adequate estimate of the neutral rate of molecular evolution we interpret the three different outcomes of comparisons between dN and dS thusly:
- If dN < dS : Non-synonymous sites are evolving slower than synonymous sites. We interpret this to mean that the non-synonymous sites are under selective constraint (or purifying selection) because they are evolving at a rate slower than the neutral expectation. This is the case for most genes when comparisons are made between any two species. This means that most amino acid substitutions are deleterious.
- If dN = dS : Non-synonymous and synonymous sites are evolving at equal rates. Hence, non-synonymous substitutions are neutral.
- If dN > dS : Non-synonymous sites are evolving faster than synonymous sites. This is evidence for positive selection because we assume that natural selection is acting on the amino acid sequence of the protein.
The relationship between dN and dS is often summarized by the ratio of the two statistics (dN/dS). If dS>dN then 1>dN/dS; if dN=dS then dN/dS=1; if dN>dS then dN/dS>1.
There are some obvious limitations (and some more subtle ones) to using dN and dS to identify genes under positive selection. As I mentioned earlier, dS may not be an adequate estimate of neutral evolution. Furthermore, when only one (or very few) amino acids are under positive selection, these statistics are not very sensitive to that selection. If synonymous substitutions accumulate, in general, at a faster rate than non-synonymous substitutions, the one (or few) amino acid substitution that occurs due to positive selection will not influence dN enough to cause it to be significantly greater than dS. One way to get around this problem is to perform a sliding window analysis (examine subsets of the coding sequence, say 30 codons at a time) to detect regions of a gene that have a signature of positive selection.
Despite its limitations, this codon based approach for detecting selection is widely used and provides the basis for other methods of detecting selection. Next time we will discuss phylogenetics and relative rates.