evolgen archive: Species Sampling for Whole Genome Sequencing

Now that we have entered the post-genomics era, with the genomes of most model organisms completely sequenced (as well as the human genome), it is up to genome centers, researchers, and funding agencies to decide which genomes to sequence next. These decisions are influenced by previous research done on the organism, size of the genome (in base pairs, not genes), biology of the species, and its evolutionary relationship to other species with a complete genome sequence. The process differs a bit between bacterial and eukaryotic species (and within eukaryotes, between animals, plants, and fungi -- and within the different animal groups: insects, vertebrates, etc) because of general differences in genome size, biology, and previous knowledge in these taxa, but many concerns are universal.

In the Drosophila community, for example, much consideration went into picking the second genome to sequence (after D. melanogaster, of course). D. pseudoobscura was chosen because it was thought that important non-coding sequences could be identified through comparisons with D. melanogaster. In turns out that the two non-coding sequences from these two species are too divergent to align homologous regions, so conserved non-coding sequences could not be identified. This was not entirely detrimental, as the comparisons of gene order between species revealed insights into the origin and evolution of genome rearrangements.

Currently, ten more Drosophila genomes are in the process of being sequenced and analyzed for publication. In choosing these genomes, the major players in Drosophila genomics decided to focus on developing annotation and analysis tools, studying tempos of evolution, and examining speciation. The third and fourth species to be sequenced (D. simulans and D. yakuba) are both close relatives of D. melanogaster (at least closer than D. pseudoobscura). Other species were chosen to get different amounts of divergence from D. melanogaster, from closely related species (D. erecta and D. ananassae) to more divergent species (D. willistoni, D. mojavensis, D. virilis, and D. grimshawi). All of these species represent some of the best studied for a particular amount of divergence from D. melanogaster. Furthermore, two other species (D. sechellia and D. persimilis) were chosen for studying the genetics of speciation because of their close relationship with other species getting their genome sequenced. The D. pseudoobscura - D. persimilis system is one of the best studied, dating back to work done by Dobzhanksy, but also including really cool stuff on behavior, genetics, and ecology getting done today. The whole genome sequences of these two species are already being put to use to study the genetics of speciation.

The species sampling strategy for Drosophila genomes will be tested during the genome analysis stage (currently underway). If it proves to be successful, it should serve as a model for other whole genome sequencing in other taxa. But what about other approaches? A paper by Fabio Pardiand Nick Goldman in PLoS Genetics (it’s free, so you have no excuse not to read it) presents a new approach toward species sampling in genomics. They begin with the assumption that, ultimately, sequencing projects would like to maximize the amount of divergence between species sampled in some taxonomic group. I question this assumption, as the Drosophila projects do not follow this strategy (they include an excess of species closely related to D. melanogaster), and the current sequencing in mammals does not either (too many primates). If, however, we look exclude primates from the mammalian sequencing projects, it does appear that the sequencing projects are attempting to maximize divergence.

Figure 1.

Phylogenetic Scopes and Divergence of Sets of Species

(A) Phylogenetic scope comprising hypothetical species A, B, C, D, and E. Numbers are branch lengths indicating evolutionary distances (not necessarily reflecting temporal distances). The subtree connecting species B, C, and E is shown in red and has divergence 1 + 3 + 1 + 5 + 2 + 4 = 16. Applying the greedy algorithm always produces maximally divergent extensions of the original set. For example, the subsets constructed starting with B—BE (divergence 11), BCE (16), BCDE (19)—have maximum divergence among those obtainable by adding one, two, and three additional species, respectively. The series AE (12), ACE (17), ACDE (20) is optimal among all possible subsets of two, three, and four species.

(B) Phylogenetic scope comprising placental mammals that have been or are being sequenced (in red) and candidates for future sequencing. If five groups choose the next five targets for sequencing using the greedy strategy described in the text, the following species (in blue) will be selected (in order): (1) tenrec, (2) hedgehog, (3) rock hyrax, (4) tree shrew, (5) dog-faced fruit bat (a megabat). Within the phylogenetic scope shown, this is guaranteed to be the choice of five species that maximises the total resulting divergence. These species have recently been announced amongst targets for future sequencing.

The paper asks, “How can we ensure that divergence is maximized in sampled genomes?” Well, it turns out that if each sequencing project was chosen independently so as to sequence the single most divergent genome, we would get the same result as if they were all chosen together to maximize all of the divergence between sampled species. Thus, the “greedy” algorithm (each project only looking out for its own self-interests) is just as successful as a more holistic approach. The groups that sequence particular genomes must make their choices open to the rest of the research community, as they influence future decisions, but we do not need to plan more than one step ahead. This comes as a surprise because greedy algorithms are not usually the best way to solve a computational problem.

The greedy algorithm can also be applied to projects such as the Drosophila genomes, in which maximizing divergence was not the ultimate goal. In cases where we would like to sequence genomes with different amounts of divergence from some model organism (ie, D. melanogaster) we must employ an incremental approach. This approach allows us to identify sequence conservation for different amounts of divergence. Important, but rapidly evolving sequences, can be identified using the closely related species, whereas more conserved sequences can be differentiated using more divergent species. Non-coding sequences, for example, evolve faster than protein-coding sequences, so conserved non-coding sequences can only be identified using close relatives. Amino acid (or protein) sequences evolve slower, and these would not differ much between close relatives, so we need more distant species to study these sequences.

Despite the formal proof that the greedy algorithm works for different sequencing strategies, I doubt we will see this type of selfish behavior from sequencing centers. If they do choose to sequence a genome for self-serving purposes, it will most likely be because a particular organism interests them, and not to maximize divergence or pick the next most divergent species. Altruistic results will probably come from consciously cooperative behavior.

Pardi F, Goldman N. 2005. Species choice for comparative genomics: being greedy works. PLoS Genet. 1: e71

Richards, S, Liu, Y, Bettencourt, BR., Hradecky, P, Letovsky, S, et al. 2005. Comparative genome sequencing of Drosophila pseudoobscura: Chromosomal, gene, and cis-element evolution. Genome Res. 15: 1-18

3 Comments:

At 4:15 PM, Amit said...: There is a meeting in Tucson on Genomics of Closely Related Organisms. I wish I was going, if for no other reason January in Arizona is bliss (compared to a lot of the U.S. at this time of year)
At 9:26 AM, Anonymous said...: Hi. I'm one of the authors of the paper on species sequencing strategies. Thanks for taking an interest.

We hoped to be a little provocative deliberately. I agree, you are right to question the idea that we should maximize the amount of divergence between species sampled. One of your reasons, though, I disagree with: you suggest that the idea is wrong because it is not what was done for Drosophila or for primates. I think you are putting too much faith in the people making those decisions! Just because they are doing something does not mean it is right.

However, your other points are good ones. For instance, obviously humans and flies (for example) are simply too different for us to learn much about genes in the brain (or the wings!) from comparative studies. And the issue of simply finding homologous sequences and aligning them may become important over the evolutionary distances between species that do share common biology. We do mention these ideas in our paper.

The approach taken with Drosophila feels like a sensible one intuitively (although intuition can be wrong, e.g. greedy algorithms can be optimal!). It will indeed be interesting to see how well future studies work out, taking different approaches. In the end, I feel we are bound to find that what we should sequence depends in every case on quite what we are trying to find out, and no automated procedure for choosing species will ever give the right answer in all cases.
At 2:37 PM, RPM said...: Thanks for commenting Nick. I think there are times when maximizing divergence is a good idea (for example, when there is not model/well-studied species in a particular taxon). If we're only interested in learning about mammalian evolution (or reconstructing ancestral genomes), then it makes sense to maximize divergence.

If you want to learn about a particular genome (human, D. melanogaster, or any other model/well-studied speices), then it does not make sense to maximize divergence within a particular taxon -- in these situations, I would use what I referred to as the "incremental approach". This allows you to identify different types of conserved sequences, as well as locate regions with signatures of positive selection.

You guys suggest, however, that the greedy algorithm works for either scenario, and only the parameters for chosing species (either maximum divergence, or the next most divergent clade) would differ. Am I correct about this?

<< Home

evolgen archive

Friday, December 02, 2005

Species Sampling for Whole Genome Sequencing

Figure 1.

3 Comments: