Detecting Natural Selection (Part 6)
Calculating Nucleotide Sequence Polymorphism
This is the seventh of multiple postings I plan to write about detecting natural selection using molecular data (ie, DNA sequences). The first post contained a brief introduction and can be found here. The second post described the organization of the genome, and the third described the organization of genes. The fourth post described codon based models for detecting selection, and the fifth detailed how relative rates can be used to detect changes in selective pressure. The sixth post dealt with classical population methods for detecting selection using allele and genotype frequencies.
Much of the popular press surrounding recent publications that proclaim to detect natural selection does not adequately detail whether the researchers have identified purifying selection (selective constraint) or positive (aka Darwinian) selection. See here for a particularly poor article that confused the heck out of me (and I claim to understand genome analysis). Recall from our discussion of codon based models for detecting selection that we can distinguish between these two selection regimes, although the codon based models are not very powerful. We can also use relative rates to identify differences in selective constraint between lineages, but it is difficult to differentiate relaxed constraint from positive selection using these analyses. Furthermore, there are a lot of computational biologists interested in identifying highly conserved regions between genomes under the assumption that these sequences are probably under strong purifying selection.
But what if we’re interested in finding signatures of positive selection? In my opinion, the best data for this type of analysis is DNA sequence polymorphism. This post will detail some of the core concepts in calculating polymorphism from DNA sequences. Subsequent posts will detail the statistical analyses that can be performed on this data.
Signatures of selection can be detected using either protein coding or non-coding DNA; the statistical techniques differ for analyzing these two types of sequences, but the initial steps are very similar. We will refer to region of the genome we are sequencing as a locus. The length of this region can range anywhere from about 500 nucleotides to thousands of nucleotides (the longer the region, the more work the grad student or undergrad doing the lab work must put in). Once a researcher has chosen a particular locus (either because they know sequencing it will be feasible or because it is a near a gene of particular interest), she will sequence it in approximately 20-30 individuals from some population (although it is common to sequence in as few as 10 or as many as hundreds of individuals). Choosing which individuals to sample (and which populations to sample from) is beyond the scope of this entry and often depends greatly on the natural history of the species in question.
DNA alignment. Click on the image for a larger version.
Once all of the “wet-lab” work is completed (which can take weeks if you are lucky or years if you chose a poorly studied taxon and tricky locus), the sequences must be aligned (see above). To read the alignment above, you need a quick primer on the nomenclature. The column on the left contains the identifiers (or names) of the sequences. The first set of sequences is the alignment of the first 50 nucleotides from the locus, followed by numbers that indicate the position of each nucleotide. Below that, we have positions 51-100, then 101-150, etc. Ideally, we would have all the positions aligned in a single block, but the limitations of printed paper prevent such a representation. The color coding is just there to make distinguishing the nucleotides from each other.
I will assume we have a good alignment, although the alignment process can be quite tricky. Each homologous nucleotide can be compared between all of the sequences in an alignment. Sites at which all sequences have the same nucleotide (position 1 above, where all sequences have a G) are monomorphic. If there are different nucleotides at a site (position 6, where some sequences have a C and the others have a T), that site is said to be polymorphic. We can count the number of sites that are monomorphic and the amount that are polymorphic.
We will discuss two types of polymorphism. The first, the number of segregating sites (Sn), is just the amount of polymorphic nucleotide sites in the data set. As the number of sequences in the data set increases, so too do the number of segregating sites. If this is not obvious to you, imagine we have sampled five sequences from a population (if you need a picture, imagine it’s the first five sequences listed above: D_yakuba, RPU74073, RPU74053, PSU74068, TJU74075). We can calculate the number of segregating sites using these five sequences, and this number will usually be less (and never be more) than if we added five more sequences to our sample. Each time we add a sequence, we identify more segregating sites, although there are diminishing returns as more sequences get added -- eventually (ok, after a really long time) you identify all of the segregating sites in a population, and adding more sequences will only result in adding the same polymorphic sites that you have already identified.
The number of segregating sites depends on the number of sequences in your data set, and we will need to apply a simple correction to take this into account (I will address this in a later post). The second type of polymorphism, the average pairwise differences (p), does not suffer from this problem. To calculate p we must first compare all pairwise combinations of sequences in the data set (compare the first sequence to the second, the first to the third, the first to the forth, all the way to the second to the last and the last). In each pairwise comparison we calculate the number of nucleotide sites that differ between those two sequences. Once we have calculated the number of differences between each pair, we divide by the number of comparisons made to get the average pairwise differences. Because we are taking an average, this estimate of polymorphism does not depend on the number of sequences in the sample.
In future posts we will discuss the theoretical framework behind detecting selection using nucleotide polymorphism and how our two estimates of polymorphism can be used to detect natural selection.