Species Sampling for Whole Genome Sequencing
Now that we have entered the post-genomics era, with the genomes of most model organisms completely sequenced (as well as the human genome), it is up to genome centers, researchers, and funding agencies to decide which genomes to sequence next. These decisions are influenced by previous research done on the organism, size of the genome (in base pairs, not genes), biology of the species, and its evolutionary relationship to other species with a complete genome sequence. The process differs a bit between bacterial and eukaryotic species (and within eukaryotes, between animals, plants, and fungi -- and within the different animal groups: insects, vertebrates, etc) because of general differences in genome size, biology, and previous knowledge in these taxa, but many concerns are universal.
In the Drosophila community, for example, much consideration went into picking the second genome to sequence (after D. melanogaster, of course). D. pseudoobscura was chosen because it was thought that important non-coding sequences could be identified through comparisons with D. melanogaster. In turns out that the two non-coding sequences from these two species are too divergent to align homologous regions, so conserved non-coding sequences could not be identified. This was not entirely detrimental, as the comparisons of gene order between species revealed insights into the origin and evolution of genome rearrangements.
Currently, ten more Drosophila genomes are in the process of being sequenced and analyzed for publication. In choosing these genomes, the major players in Drosophila genomics decided to focus on developing annotation and analysis tools, studying tempos of evolution, and examining speciation. The third and fourth species to be sequenced (D. simulans and D. yakuba) are both close relatives of D. melanogaster (at least closer than D. pseudoobscura). Other species were chosen to get different amounts of divergence from D. melanogaster, from closely related species (D. erecta and D. ananassae) to more divergent species (D. willistoni, D. mojavensis, D. virilis, and D. grimshawi). All of these species represent some of the best studied for a particular amount of divergence from D. melanogaster. Furthermore, two other species (D. sechellia and D. persimilis) were chosen for studying the genetics of speciation because of their close relationship with other species getting their genome sequenced. The D. pseudoobscura - D. persimilis system is one of the best studied, dating back to work done by Dobzhanksy, but also including really cool stuff on behavior, genetics, and ecology getting done today. The whole genome sequences of these two species are already being put to use to study the genetics of speciation.
The species sampling strategy for Drosophila genomes will be tested during the genome analysis stage (currently underway). If it proves to be successful, it should serve as a model for other whole genome sequencing in other taxa. But what about other approaches? A paper by Fabio Pardi and Nick Goldman in PLoS Genetics (it’s free, so you have no excuse not to read it) presents a new approach toward species sampling in genomics. They begin with the assumption that, ultimately, sequencing projects would like to maximize the amount of divergence between species sampled in some taxonomic group. I question this assumption, as the Drosophila projects do not follow this strategy (they include an excess of species closely related to D. melanogaster), and the current sequencing in mammals does not either (too many primates). If, however, we look exclude primates from the mammalian sequencing projects, it does appear that the sequencing projects are attempting to maximize divergence.
Phylogenetic Scopes and Divergence of Sets of Species
(A) Phylogenetic scope comprising hypothetical species A, B, C, D, and E. Numbers are branch lengths indicating evolutionary distances (not necessarily reflecting temporal distances). The subtree connecting species B, C, and E is shown in red and has divergence 1 + 3 + 1 + 5 + 2 + 4 = 16. Applying the greedy algorithm always produces maximally divergent extensions of the original set. For example, the subsets constructed starting with B—BE (divergence 11), BCE (16), BCDE (19)—have maximum divergence among those obtainable by adding one, two, and three additional species, respectively. The series AE (12), ACE (17), ACDE (20) is optimal among all possible subsets of two, three, and four species.
(B) Phylogenetic scope comprising placental mammals that have been or are being sequenced (in red) and candidates for future sequencing. If five groups choose the next five targets for sequencing using the greedy strategy described in the text, the following species (in blue) will be selected (in order): (1) tenrec, (2) hedgehog, (3) rock hyrax, (4) tree shrew, (5) dog-faced fruit bat (a megabat). Within the phylogenetic scope shown, this is guaranteed to be the choice of five species that maximises the total resulting divergence. These species have recently been announced amongst targets for future sequencing.
The paper asks, “How can we ensure that divergence is maximized in sampled genomes?” Well, it turns out that if each sequencing project was chosen independently so as to sequence the single most divergent genome, we would get the same result as if they were all chosen together to maximize all of the divergence between sampled species. Thus, the “greedy” algorithm (each project only looking out for its own self-interests) is just as successful as a more holistic approach. The groups that sequence particular genomes must make their choices open to the rest of the research community, as they influence future decisions, but we do not need to plan more than one step ahead. This comes as a surprise because greedy algorithms are not usually the best way to solve a computational problem.
The greedy algorithm can also be applied to projects such as the Drosophila genomes, in which maximizing divergence was not the ultimate goal. In cases where we would like to sequence genomes with different amounts of divergence from some model organism (ie, D. melanogaster) we must employ an incremental approach. This approach allows us to identify sequence conservation for different amounts of divergence. Important, but rapidly evolving sequences, can be identified using the closely related species, whereas more conserved sequences can be differentiated using more divergent species. Non-coding sequences, for example, evolve faster than protein-coding sequences, so conserved non-coding sequences can only be identified using close relatives. Amino acid (or protein) sequences evolve slower, and these would not differ much between close relatives, so we need more distant species to study these sequences.
Despite the formal proof that the greedy algorithm works for different sequencing strategies, I doubt we will see this type of selfish behavior from sequencing centers. If they do choose to sequence a genome for self-serving purposes, it will most likely be because a particular organism interests them, and not to maximize divergence or pick the next most divergent species. Altruistic results will probably come from consciously cooperative behavior.
Pardi F, Goldman N. 2005. Species choice for comparative genomics: being greedy works. PLoS Genet. 1: e71
Richards, S, Liu, Y,