Molecular Signatures of Natural Selection
Rasmus Nielsen
Center for Bioinformatics and Department of Evolutionary Biology, University of Copenhagen, 2100 Copenhagen Ø, Denmark; email: rasmus@binf.ku.dk
Annu. Rev. Genet. 2005. 39:197–218
First published online as a Review in Advance on August 31, 2005
The Annual Review of
Genetics is online at http://genet.annualreviews.org
doi: 10.1146/ annurev.genet.39.073003.112420
Copyright ⃝c 2005 by Annual Reviews. All rights reserved
0066-4197/05/1215- 0197$20.00
Key Words
Darwinian selection, neutrality tests, genome scans, positive selection, phylogenetic footprinting
Abstract
There is an increasing interest in detecting genes, or genomic re- gions, that have been targeted by natural selection. The interest stems from a basic desire to learn more about evolutionary pro- cesses in humans and other organisms, and from the realization that inferences regarding selection may provide important functional in- formation. This review provides a nonmathematical description of the issues involved in detecting selection from DNA sequences and SNP data and is intended for readers who are not familiar with popu- lation genetic theory. Particular attention is placed on issues relating to the analysis of large-scale genomic data sets.
197
Annu. Rev. Genet. 2005.39:197-218. Downloaded from www.annualreviews.org by University of New Mexico on 08/16/11. For personal use only.
Contents
INTRODUCTION…………….. 198 The Nomenclature of Selection
Models ………………….. 198 POPULATION GENETIC
PREDICTIONS…………….. 199 POPULATION GENETIC
SIGNATURES OF
SELECTION ………………. 201 PopulationDifferentiation…….. 201 The Frequency Spectrum……… 202 ModelsofSelectiveSweeps……. 202 LDandHaplotypeStructure…… 202 MacDonald-KreitmanTests……. 203
STATISTICALCONCERNS……. 204 SIGNATURES OF SELECTION
IN COMPARATIVE DATA…… 205
Targets of Positive Selection . . . . . . 206 GENOMICAPPROACHES…….. 207 PRF Models ………………… 208 SNP Data…………………… 208 Comparative Genomic Data . . . . . . 209 FUNCTIONAL INFERENCES . . . . 209 Phylogenetic Footprinting . . . . . . . . 209 Disease Genetics …………….. 209 Positive Selection…………….. 210 EVIDENCEFORSELECTION…. 210
SNP: single nucleotide polymorphism
INTRODUCTION
Population geneticists have for decades been occupied with the problem of quantifying the relative contribution of natural selection in shaping the genetic variation observed among living organisms. In one school of thought, known as the neutral theory, most of the vari- ation within and between species is selectively neutral, i.e., it does not affect the fitness of the organisms (58, 59). New mutations that arise may increase in frequency in the popula- tion due to random factors, even though they do not provide a fitness advantage to the or- ganisms carrying them. The process by which allele frequencies change in populations
due to random factors is known as genetic drift.
A second school of thought maintains that a large proportion of the variation observed does affect the fitness of the organisms and is subject to Darwinian selection (39). These issues have not been settled with the availabil- ity of large-scale genomic data, but the debate has shifted from a focus on general laws or pat- terns of molecular evolution to the description of particular instances where natural selec- tion has shaped the pattern of variation. This type of analysis is increasingly being done be- cause it has become apparent that inferences regarding the patterns and distribution of se- lection in genes and genomes may provide im- portant functional information. For example, in the human genome, the areas where dis- ease genes are segregating should be under selection (assuming that the disease pheno- type leads to a reduction in fitness). Even very small fitness effects may, on an evolutionary time scale, leave a very strong pattern. There- fore, in theory it may be possible to identify putative genetic disease factors by identifying regions of the human genome that currently are under selection (7). In general, positions in the genome that are under selection must be of functional importance. Inferences regard- ing selection have therefore been used exten- sively to identify functional regions or protein residues (12, 91). The purpose of this paper is to review the current knowledge regard- ing the effect of selection on a genome and to discuss methods for detecting selection using molecular data, especially genomic DNA se- quence and single nucleotide polymorphism (SNP) data.
The Nomenclature of Selection Models
Much confusion exists in the literature regard- ing how various types of selection are defined, in particular because some of the terminology is used slightly differently within different sci- entific communities. At the risk of contribut- ing further to this confusion, I propose here
198 Nielsen
Annu. Rev. Genet. 2005.39:197-218. Downloaded from www.annualreviews.org by University of New Mexico on 08/16/11. For personal use only.
some simple definitions for some of the com- mon terms used in the discussion of selection models before moving on to the main topics of this review.
The basic population genetic terms are well-defined. The classical population genetic models that students of biology will first en- counter are models with two alleles, typically denoted A and a. Selection then occurs if the fitnesses of the three possible genotypes (wAA, wAa , and waa ) are not all equal. There is di- rectional selection if the fitnesses of the three genotypes are not all equal and if wAA > wAa > waa or wAA < wAa < waa. Directional selec- tion tends to eliminate variation within pop- ulations and either increase or decrease vari- ation between species depending on whether A or a is the new mutant. Overdominance oc- curs if the heterozygote has the highest fitness if wAA < wAa > waa . Overdominance is a case of balancing selection where variability is main- tained in the population due to selection. In haploid organisms, selection occurs if wA ̸= wa and overdominance is not possible. The dif- ference in fitness between alleles is the selec- tion coefficient, i.e., for the haploid model the selection coefficient could be defined as sA = wA −wa.
In the molecular evolution literature, it has been common to use the terminology of positive selection, negative selection, purify- ing selection, and diversifying selection. Here we define negative selection as any type of selection where new mutations are selected against. Likewise, we define positive selection as any type of selection where new mutations are advantageous (have positive selection co- efficients). In the context of the simple two- allele models, both directional selection and overdominance can be cases of positive selec- tion. Purifying selection is identical to neg- ative selection in that it describes selection against new variants. Diversifying selection has in the population genetics literature been synonymous with disruptive selection, a type of selection where two or more extreme phe- notypic values are favoured simultaneously. This type of selection will often increase vari-
ability, and diversifying selection has, there- fore, in the molecular evolution literature re- cently been used more generically to describe any type of selection that increases variability. However, as disruptive selection may reduce genetic variability when one of the extreme types becomes fixed in the population, and since there are many other forms of selection that can increase levels of genetic variability, the more generic use of the term “diversifying selection” should probably be avoided.
When a new mutant does not affect the fit- ness of the individual in which it arises (i.e., wAA = wAa = wa), it is said to be neutral. In general, neutrality describes the condition where the loci under consideration are not af- fected by selection. A statistical method aimed at rejecting a model of neutral evolution is called a neutrality test.
POPULATION GENETIC PREDICTIONS
One of the main interests in molecular pop- ulation genetics is to distinguish molecular variation that is neutral (only affected by random genetic drift) from variation that is subject to selection, particularly positive selection. An important point is that neu- tral models usually allow for the presence of strongly deleterious mutations that have such strong negative fitness consequences that they are immediately eliminated from the popula- tion (58). If selection only involves such muta- tions of very strong effect, the only mutations that will actually segregate in the population are the neutral mutations. Therefore, neu- tral models include the possible existence of pervasive strong negative selection. Although negative or purifying selection may be of great interest because it may help detect regions or residues of functional importance, much in- terest in the evolutionary literature focuses on positive selection because it is associated with adaptation and the evolution of new form or function. One of the main points of con- tention in population genetics has been the degree to which positive selection is important
Balancing selection: selection that increases variability within a population
Positive selection:
selection acting upon new advantageous mutations
Negative selection:
selection acting upon new deleterious mutation
Neutrality test: a statistical test of a model which assumes all mutations are either neutral or strongly deleterious
Neutral mutation:
a mutation that does not affect the fitness of individuals who carry it in either heterozygous or homozygous condition
www.annualreviews.org • Natural Selection
199
Annu. Rev. Genet. 2005.39:197-218. Downloaded from www.annualreviews.org by University of New Mexico on 08/16/11. For personal use only.
Selective sweep:
the process by which a new advantageous mutation eliminates or reduces variation in linked neutral sites as it increases in frequency in the population
in explaining the pattern of variability within and between species (39, 59).
Much of the theoretical literature in pop- ulation genetics over the past 50 years has focused on developing and analyzing models that generalize the previously mentioned ba- sic di-allelic models to models where more than two alleles may be segregating, where multiple mutations may arise and interact— possibly in the presence of recombination, where the environment may be changing through time, and where random genetic drift may be acting in populations subject to vari- ous demographic forces (25, 39). From theory alone we have gained many valuable insights, including the fact that the efficacy of selection depends not only on the selection coefficient, but primarily on the product of the selection coefficient and the effective population size. An increased effect of selection may be due to either an increased population size or a larger selection coefficient. Among other important
findings is that balancing selection may oc- cur for many reasons other than overdomi- nance, (e.g., fluctuating environmental con- ditions) and could therefore, potentially, be quite common (38, 39). However, the efficacy of selection will tend to be reduced when mul- tiple selected alleles are segregating simulta- neously in the genome. The mutations will tend to interfere with each other and reduce the local effective population size (8, 29, 40, 57). Many population geneticists used to be- lieve that the number of selective deaths re- quired to maintain large amounts of selection would have to be so large that selection would probably play a very small role in shaping ge- netic variation (43, 60, 61). These types of ar- guments, known as genetic load arguments, were instrumental in the development of the neutral theory. However, the amount of selec- tion that a genome can permit depends on the way mutations interact in their effect on or- ganismal fitness and on several other critical model assumptions (25, 62, 71, 107). Popula- tion genetic theory does not exclude the possi- bility that selection is very pervasive and can- not alone determine the relative importance and modality of selection in the absence of data from real living organisms (25, 39).
Much excitement currently exists in the population genetics communities over the fact that many predictions generated from the theory may now be tested in the context of the large genomic data sets. In particular, we should be able to detect the molecular signatures of new, strongly selected advanta- geous mutations that have recently become fixed (reached a frequency of one in the pop- ulation). As these mutations increase in fre- quency, they tend to reduce variation in the neighboring region where neutral variants are segregating (13, 51, 52, 68). This process, by which a selected mutation reduces variabil- ity in linked sites as it goes to fixation, is known as a selective sweep (Figure 1). The hope is that by analysis of large compara- tive genomic data sets and large SNP data sets we will be able to determine how and where both positive and negative selection
Figure 1
The effect of a selective sweep on genetic variation. The figure is based on averaging over 100 simulations of a strong selective sweep. It illustrates how the number of variable sites (variability) is reduced, LD is increased, and the frequency spectrum, as measured by Tajima’s D, is skewed, in the region around the selective sweep. All statistics are calculated in a sliding window along the sequence right after the advantageous allele has reached frequency 1 in the population. All statistics are also scaled so that the expected value under neutrality equals one.
200 Nielsen
Annu. Rev. Genet. 2005.39:197-218. Downloaded from www.annualreviews.org by University of New Mexico on 08/16/11. For personal use only.
has affected variation in humans and other organisms.
POPULATION GENETIC SIGNATURES OF SELECTION
One of the main effects of selection is to mod- ify the levels of variability within and between species (Table 1). A selective sweep tends to drastically reduce variation within a popula- tion, but will not lead to a reduction in species- specific differences. Conversely, negative selection acting on multiple loci will tend to reduce variability between species more dras- tically then variability within species. Table 1 summarizes how various types of selection af- fect variability. Note that changes in the mu- tation rate alone will have the same effect on interspecific (between-species) and intraspe- cific (within-species) variability. However, se- lection affects intraspecific and interspecific variability differently. Many of the common population genetic methods for detecting se- lection are therefore based on comparing variation with and between species, most fa- mously the HKA test (48). In this test, the rate of polymorphisms to divergence is com-
pared for multiple genes. If the ratio varies more among genes than expected on a neu- tral model, neutrality is rejected.
Population Differentiation
Selection may in many cases increase the de- gree of differentiation among populations. In particular, recent theory shows that a selective sweep can have a dramatic impact on the level of population subdivision, particularly when the sweep has not yet spread to all populations within a species (20, 65, 97). When a locus shows extraordinary levels of genetic popula- tion differentiation, compared with other loci, this may then be interpreted as evidence for positive selection.
One of the first neutrality tests proposed, the Lewontin-Krakauer (63) test, takes advan- tage of this fact. This test rejects the neu- tral model for a locus if the level of ge- netic differentiation among populations is larger than predicted by a specific neutral model. It has recently been resurrected in var- ious forms (1, 9, 10, 53, 92, 114), primar- ily driven by the availability of large-scale genomic data. For example, Akey et al. (1) looked at variation in FST (the most common
Table 1 The effect of selection and mutation on variability within and between species
Ratio of interspecific Intraspecific Interspecific to intraspecific
Evolutionary factor variabilitya variability variability Frequency spectrum
Increased mutation rate Negative directional
selection
Positive directional selection
Balancing selection
Selective sweep (linked neutral sites)
Increases Reduced
May increase or decrease
Increases
Decreased
Increases Reduced
Increased
May increase or decrease
No effect on mean rate of substitution, but the variance increases
No effect
Reduced if selection is
not too strong Increased
Reduced
Increased
No effect Increases the
proportion of low
frequency variants Increases the
proportion of high
frequency variants Increases the
proportion of intermediate frequency variants
Mostly increases the proportion of low frequency variants
aNote that selection also affects other features of the data not mentioned here, such as levels of LD, haplotype structure, and levels of population subdivision.
www.annualreviews.org • Natural Selection 201
Annu. Rev. Genet. 2005.39:197-218. Downloaded from www.annualreviews.org by University of New Mexico on 08/16/11. For personal use only.
Frequency spectrum: the allelic sample distribution in independent nucleotide sites
LD: linkage disequilibrium
measure of population differentiation) among human populations genome-wide. Beaumont & Balding (9) developed a sophisticated sta- tistical method for identifying loci that may be outliers in terms of levels of population subdivision.
The Frequency Spectrum
Selection also affects the distribution of al- leles within populations. For DNA sequence or SNP data, some of the most commonly ap- plied tests are based on summarizing informa- tion regarding the so-called frequency spec- trum. The frequency spectrum is a count of the number of mutations that exist in a fre- quency of xi = i/n for i=1, 2,…, n−1, in a sample of size n. In other words, it repre- sents a summary of the allele frequencies of the various mutations in the sample. In a stan- dard neutral model (i.e., a model with random mating, constant population size, no popula- tion subdivision, etc), the expected value of xi is proportional to 1/i. Selection against dele- terious mutations will increase the fraction of mutations segregating at low frequencies in the sample. A selective sweep has roughly the same effect on the frequency spectrum (13). Conversely, positive selection will tend to in- crease the frequency in a sample of mutations segregating at high frequencies. The effect of selection on the frequency spectrum is sum- marized in Figure 2.
Many of the classic neutrality tests, there- fore, focus on capturing information regard- ing the frequency spectrum. The most famous example is the Tajima’s D test (112). In this test, the average number of nucleotide dif- ferences between pairs of sequences is com- pared with the total number of segregating sites (SNPs). If the difference between these two measures of variability is larger than what is expected on the standard neutral model, this model is rejected. The effect of a selective sweep on Tajima’s D is shown in Figure 1. Fu & Li (34) extended this test to take infor- mation regarding the polarity of the informa- tion into account by the use of an evolutionary
outgroup (e.g., a chimpanzee in the analysis of human genetic variation), and more refine- ments were introduced by Fu (32, 33). Fay & Wu (28) suggested a test that weights informa- tion from high-frequency derived mutations higher. These tests are probably the most commonly applied neutrality tests to date.
Models of Selective Sweeps
The pattern of variability left by a selective sweep is a rather complicated spatial pattern (Figure 1). By taking information regarding this pattern into account, the power of the neutrality tests can be improved, and it may even be possible to pinpoint the location of a selective sweep. Kim & Stephan (56) devel- oped a method based on an explicit population genetic model of a selective sweep. Using this model, they could calculate the expected fre- quency spectrum in a site as a function of its distance to an advantageous mutation. By fit- ting the data to this model, they could esti- mate the location of the selective sweep and the strength of the selective sweep, and per- form hypothesis tests regarding the presence of a sweep. This method is particularly useful in that it takes advantage of the spatial pattern left by the sweep along the sequence.
LD and Haplotype Structure
Levels of linkage disequilibrium (LD), the correlation among alleles from different loci, will increase in selected regions. Regions con- taining a polymorphism under balancing se- lection will tend to reduce LD if the poly- morphism is old, but may increase LD in a transient phase. Selective sweeps also increase levels of LD in a transient phase (Figure 1), al- though this phase may be relatively short (82). Recently, there has been increased awareness that an incomplete sweep (when the adaptive mutation has not yet been fixed in the popula- tion) leaves a distinct pattern in the haplotype structure (87). This has led to the develop- ment of many statistical methods for detect- ing selection based on LD. Hudson et al. (47)
202
Nielsen
Annu. Rev. Genet. 2005.39:197-218. Downloaded from www.annualreviews.org by University of New Mexico on 08/16/11. For personal use only.
Figure 2
The frequency spectrum under a selective sweep, negative selection, neutrality, and positive selection. The frequency spectra under negative and positive selection are calculated using the PRF model by Sawyer & Hartl (88) for mutations with 2Ns = −5 and 5, respectively, where N is the population size and s is the selection coefficient. For the selective sweep, the frequency spectrum is calculated in a window around the location of the adaptive mutation immediately after it has reached fixation in the population. In all cases, a demographic model of a population of constant size with no population subdivision is assumed.
developed a test based on the number of alle- les occurring in a sample. Andolfatto et al. (4) developed a related test to determine whether any subset of consecutive variable sites con- tains fewer haplotypes than expected under a neutral model. A similar test was also pro- posed by Depaulis & Veuille (23). A variation on this theme was proposed by Sabeti et al. (87) who considered the increase in the num- ber of distinct haplotypes away from the loca- tion of a putative selective sweep. Kelly (54)
considered the level of association between pairs of loci. Kim & Nielsen (55) extended the method of Kim & Stephan (56) to include pairs of sites to incorporate information re- garding linkage disequilibrium.
MacDonald-Kreitman Tests
Finally, the MacDonald-Kreitman test (69) explores the fact that mutations in cod- ing regions come in two different flavors:
www.annualreviews.org • Natural Selection 203
Annu. Rev. Genet. 2005.39:197-218. Downloaded from www.annualreviews.org by University of New Mexico on 08/16/11. For personal use only.
204 Nielsen
nonsynonymous mutations and synonymous mutations. It summarizes the data in what has become known as a MacDonald-Kreitman table, which contains counts of the num- ber of nonsynonymous and synonymous mutations within and between species. If selection only affects the nonsynonymous mutations, negative selection will reduce the number of nonsynonymous mutations and positive selection will increase the number of nonsynonymous mutations, relative to the number of synonymous mutations. However, the effect will be stronger in divergence data than in polymorphism data. A test similar to the HKA test can therefore be constructed comparing the ratios of nonsynonymous to synonymous mutations within and between species. If these ratios differ significantly, this provides evidence for selection.
STATISTICAL CONCERNS
The neutrality tests are all tests of compli- cated population genetic models that make specific assumptions about the demography of the populations, in particular a constant popu- lation size and no population structure. In ad- dition, in some of the tests there may be other implicit assumptions regarding distributions of recombination rates and mutation rates. Many of these tests have long been known to be highly sensitive to the demographic as- sumptions. For example, Simonsen et al. (96) showed that Tajima’s D test (112) would re- ject a neutral model very frequently in the presence of population growth. The molecu- lar signature of population growth is in many ways similar to the local effect of a selective sweep, and neutrality tests are often used as a method to detect population growth (85). Nielsen (73), Przeworski (82), and Ingvarsson (50) also argued that simple models of popula- tion subdivision can lead the commonly used neutrality tests to reject the neutral model with high probability, even in the absence of selection. In addition, even if the presence of selection can be established, in many cases it can be difficult to distinguish between the pat-
tern left by selective sweeps and selection on slightly deleterious mutations (so-called back- ground selection) (18, 19).
Tests based on patterns of LD may be par- ticularly sensitive to the underlying model as- sumptions, because they (in addition to as- sumptions regarding demography) contain strong assumptions regarding the underly- ing recombination rates. Recent studies sug- gest that recombination rates are highly vari- able among regions (70) and among closely related species (83, 117). If that is true, it may not be advisable to focus attention to- ward patterns of LD when attempting to de- tect selection. Nonetheless, haplotype struc- ture can be highly informative, particularly in detecting incomplete selective sweeps (87). Further research into how haplotype patterns can be used robustly to infer selection may be warranted.
Because of the effect of demographic as- sumptions on the population genetic neutral- ity tests, the results of these tests have of- ten been contentious and often have not led to firm conclusions regarding the action of selection. One exception is the MacDonald- Kreitman (69) test. This test has increased ro- bustness because the sites in which synony- mous and nonsynonymous mutations occur are interspersed among each other and there- fore similarly affected by demography and ge- netic drift. In fact, the MacDonald-Kreitman (69) test is robust to any demographic assump- tion (73). Unfortunately, it may not be very suitable for detecting recent selective sweeps because both nonsynonymous and synony- mous mutations, linked to the beneficial mu- tation, will be similarly affected by the selec- tive sweep. Also, the MacDonald-Kreitman (69) test cannot distinguish between past and present selection. Reducing the information in the data simply to the number of nonsyn- onymous mutations and synonymous muta- tions leads to a significant loss of information.
One possible way to circumvent the prob- lem of demographic confounding effects is to compare multiple loci. For example, Galtier et al. (35) have implemented a statistical
Annu. Rev. Genet. 2005.39:197-218. Downloaded from www.annualreviews.org by University of New Mexico on 08/16/11. For personal use only.
method, applicable to microsatellite loci, to test whether the signature of population growth is constant among loci or varies among loci. If the effect varies significantly among loci, beyond what can be explained by the de- mographic model, this may be interpreted as evidence for a selective sweep. In general, one can assume that if strong departures from the neutral model are seen only on one or a few outlier loci, this may be interpreted as evi- dence for selection on these loci. However, certain demographic factors, such as popu- lation subdivision, may increase the variance among loci (73). Certain demographic mod- els may be more likely than others to produce outlier loci even in the absence of selection.
The application of population genetic tests other than the MacDonald-Kreitman test re- quires careful consideration of the possible range of demographic factors that may af- fect the results (2, 73). It is not very mean- ingful in itself to reject the standard neutral model using these methods without paying careful attention to the underlying demo- graphics. Even the interpretation of signifi- cant results of the MacDonald-Kreitman test requires attention to demography if the direc- tionality (positive versus negative) of selection is to be inferred (26). Fortunately, many re- cent studies go to great lengths in trying to ex- clude the possibility that rejections of a neutral model may be caused by demographic effects (3, 116).
SIGNATURES OF SELECTION IN COMPARATIVE DATA
While population genetic approaches aim at detecting ongoing selection in a population, comparative approaches, involving data from multiple different species, are suitable for de- tecting past selection. The major tool used to detect selection from comparative data is to compare the ratio of nonsynonymous mu- tations per nonsynonymous site to the num- ber of synonymous mutations per nonsynony- mous site (dN /dS ). If there is no selection, not even strongly deleterious mutations, syn-
onymous and nonsynonymous substitutions should occur at the same rate and we would ex- pect dN /dS = 1. If there is negative selection, dN /dS < 1 and if there is positive selection, dN /dS > 1. The dN /dS ratio is therefore a proxy for the effect of selection that helps to identify not only selection, but also the di- rectionality of selection. It is therefore a very commonly used tool for detection of positive selection and has been used in a variety of cases, for example, to demonstrate the pres- ence of positive selection on HIV sequences (78) and on the human major histocompati- bility locus (MHC) (49). However, as negative selection will tend to dominate in evolution, comparing the average rate of synonymous and nonsynonymous substitution in aligned sequences is a very conservative tool. If the gene is functional so that many or most mu- tations will disrupt function, the amount of positive selection needed to elevate the dN /dS above one is enormous. To overcome this problem, methods have been devised for de- tecting positive selection that takes variation in the dN /dS ratio into account (78, 127). The basic idea is to allow the dN /dS ratio to follow a statistical distribution among sites. If a dis- tribution that allows values of dN /dS > 1 fits the data significantly better than a model that does not allow for such values, this is inter- preted as evidence for positive selection. The methodology has been widely used and has led to a sharp increase in the number of loci where researchers have detected the presence of pos- itive selection (31, 100, 125). This has also led to some skepticism toward this methodology (105, 106), although it has been found to per- form well in simulation studies and is based on well-established statistical principles (5, 120, 124).
Several different statistical methods allow site-specific inferences regarding positive se- lection (30, 78, 104). The objective of these methods is to determine if specific sites have been targeted by positive (or negative) selec- tion. In several cases, these methods have been used to make functional prediction regarding particular protein residues (91).
dN: numberof nonsynonymous mutations per nonsynonymous site
ds: number of synonymous mutations per synonymous site
dN /dS ratio: the rate ratio of nonsynonymous to synonymous substitutions
www.annualreviews.org • Natural Selection
205
Annu. Rev. Genet. 2005.39:197-218. Downloaded from www.annualreviews.org by University of New Mexico on 08/16/11. For personal use only.
Table 2
A very incomplete list of methods for detecting selection from DNA sequence and SNP data
Tajima’s D and related
Modeling of selective sweep—spatial pattern
Tests based on LD
FST based and related tests
HKA test
Macdonald- Kreitman-type tests
dN /dS ratio tests
Population genetic data
Population genetic data
Population genetic data
Population genetic data
Population genetic and comparative data
Population genetic and comparative data
Comparative data or population genetic data without recombination (6)
Frequency spectrum No No
Frequency No No spectrum/spatial
pattern
LD and/or No No haplotype structure
Amount of Yes Noa population
subdivision
Number of polymor- Yes No phisms/substitutions
Number of No Yes nonsynonymous
and synonymous
polymorphisms
Nonsynonymous No Yes and synonymous
substitutions
(28, 32–34, 112) (55, 56)
(4, 23, 47, 54, 87)
(1, 9, 10, 53, 92, 114)
(48)
(16, 69)
(49, 78, 104, 123, 128, 129)
The same type of methodology used to model variation in the dN/dS ratio among sites has also been used to model estimates of dN /dS along particular lineages of a phy- logeny (123, 126, 128). This allows the testing of hypotheses regarding selective pressures on particular evolutionary lineages. Models have also been developed that allow site-specific inferences on a particular group of lineages on a phylogeny (128). Several excellent re- cent reviews describe the statistical methods used to detect selection from comparative data in more detail (124, 125). A summary of the different tests of neutrality is given in Table 2.
Targets of Positive Selection
Using analyses of comparative data, a clear picture emerges of the systems that most of-
ten are involved in positive selection of the kind that leads to increases in the dN /dS ratio (75). Typically, it involves an interaction be- tween two organisms, or two different genetic components within the same organism, that compete or interact in such a way that an equilibrium is never reached. The best known examples are host-pathogen interactions that lead to positive selection of genes in pathogens (27, 30, 45, 78, 100) or in host immune and defense systems (49, 75, 90, 100). Other examples include genes involved in gameto- genesis or expressed on the surface of gametes (75, 109, 110, 122). The forces cre- ating positive selection in these genes may in- clude sperm competition (122) and genetic conflicts between sperm and egg-cell (108). Positive selection also seems to be common in cases where selfish genes have the opportunity to create segregation distortion, potentially
Robust to Requires demographic
Test Data Pattern multiple loci factors? References
aThe degree to which these tests are robust to the underlying demographic assumptions is controversial and has not been fully explored.
206 Nielsen
Annu. Rev. Genet. 2005.39:197-218. Downloaded from www.annualreviews.org by University of New Mexico on 08/16/11. For personal use only.
reducing the fitness of the organism (46, 75). This type of genomic conflict may, for exam- ple, occur in loci associated with centromeres (46, 66, 67) or involved in apoptosis dur- ing spermatogenesis (75). Positive selection in terms of elevated dN /dS ratios tend to de- tect selection situations where repeated se- lective fixations have occurred in the same gene or in the same site, due to a contin- ued dynamic interaction. In contrast, popula- tion genetic methods have the ability to detect selection on a single adaptive mutation that recently has swept through the population.
So far, very little research has been done to detect positive selection in noncoding regions based on comparative data. Although meth- ods similar to those used to detect elevated dN /dS ratios can be devised for noncoding re- gions (119), sites in noncoding regions cannot easily be divided into possible selected sites and nonselected sites, similarly to nonsynony- mous and synonymous sites in coding regions. Nonetheless, the presence of highly variable sites in noncoding regions may be signs of positive selection, and methods to identify such sites may find good use in the analysis of comparative genomic data. A serious practical problem that may arise in the application of such methods is the possibility of confounding misalignments with hypervariable regions.
Most of the literature on statistical meth- ods for detecting selection from compara- tive data (e.g., from dN /dS ratios) and from population genetic data has been poorly con- nected. Although the comparative approaches have provided the most unambiguous evi- dence for positive selection, results have rarely been interpreted in terms of population ge- netic theory. One probable reason is that mul- tiple population genetic models could gen- erate the same pattern of observed dN/dS ratios, and that any detailed inferences of pop- ulation genetic processes using comparative data would be based on a very strong as- sumptions regarding the way fitnesses are as- signed to mutations (79). Comparative data in themselves are, therefore, unlikely to provide more detailed information regarding popu-
lation genetic processes but relatively vague assertions of positive and negative selection and their distribution in the genome. Infer- ences regarding the type of negative or pos- itive selection operating (e.g., balancing ver- sus positive directional selection) must involve population genetic data. Moreover, compara- tive approaches cannot alone determine if se- lection is currently acting in a population. For such inferences population genetic data are also needed.
GENOMIC APPROACHES
The availability of large-scale genomic data has created new challenges and opportunities, especially in allowing for more nonparametric outlier analyses. Genes with increased levels of LD, reduced or enhanced levels of variabil- ity, increased levels of population differentia- tion, or skewed allele frequency spectra may be good candidates for selected loci. Recently, there has been heightened interest particu- larly in using increased population subdivision among populations as a method for detecting selection (1, 9, 44, 53, 64, 92, 93, 101, 102, 114). For example, Akey et al. (1) used varia- tion in FST (a common measure of population subdivision) in the human genome to identify regions of increased population subdivision.
However, the availability of genomic data does not solve the fundamental problem that population-level demographic processes and selection are confounded. Many demographic processes, such as certain types of population subdivision, may increase the variance in the statistics used to detect selection. Certain de- mographic models are, therefore, more likely than other models to produce outliers. The outlier approach in population genetics does not solve the problem that a postulated sig- nature of selection, inferred from population genetic data, may instead be the product of complicated demographics. Nonetheless, cer- tain approaches based on detecting extreme levels of population subdivision seem to have some robustness to the model assumptions (9, 114).
www.annualreviews.org • Natural Selection 207
Annu. Rev. Genet. 2005.39:197-218. Downloaded from www.annualreviews.org by University of New Mexico on 08/16/11. For personal use only.
PRF: Poisson random field
PRF Models
The simultaneous analysis of multiple ge- nomic loci allows the estimation of parame- ters that are common among loci, potentially leading to increased power and robustness. For example, Bustamante et al. (16) analyzed MacDonald-Kreitman tables from Arabidopsis and Drosophila in a statistical framework that allows the divergence time between species to be a shared parameter among all loci, leading to increased statistical power. Simi- lar approaches can be used to increase the robustness of the statistical methods by ex- plicitly estimating demographic parameters, thereby taking the uncertainty introduced by the unknown demographic processes into ac- count. This is particularly convenient in the framework of Poisson random field (PRF) models introduced by Sawyer & Hartl (88). These models assume that all loci (individ- ual SNP sites) are independent, i.e., effec- tively unlinked. This implies that they may provide a good approximation in the analysis of SNP data from multiple locations through- out the genome, but less so in the analysis of DNA sequence data from a single or a few loci. In these models, the expected frequency spectrum (or the entries of a MacDonald- Kreitman table) can be calculated directly us- ing mathematical models. This means that selection coefficients for particular classes of mutations can be estimated directly, and var- ious hypotheses regarding selection can be tested in a rigorous statistical framework (15– 17, 89). For example, it is possible to estimate which types of amino acid-changing muta- tions have the largest effect on fitness (15, 118). Such methods may eventually be very useful when designing statistical methods for predicting which mutations are most likely to cause disease. However, inferences based on PRF models differ fundamentally from most other methods for identifying selection, be- cause the effect of selection on linked neu- tral sites is not incorporated into the models. Whereas most methods for detecting positive selection in terms of selective sweeps consider
the effect of a positively selected mutation on the nearby neutral variation, PRF models pro- vide predictions regarding the selected mu- tation itself. In most applications, estimates based on PRF models will, therefore, be bi- ased (17). Nonetheless, the PRF models pro- vide a convenient computationally tractable statistical framework for examining the effect of selection on different classes of mutations.
Williamson et al. (118) used PRF models to estimate the average selection coefficient act- ing on different classes of mutations in the hu- man genome. The novelty of their approach (118) was that a demographic model was fitted to the data from synonymous mutations, while selection coefficients were estimated for the same demographic model applied to nonsyn- onymous mutations. The resulting test was shown to be robust to many different as- sumptions regarding demographic processes. By explicitly incorporating demography into the model, a high degree of robustness was achieved. Unfortunately, there are no simi- lar approaches for detecting selection from individual loci containing multiple linked mu- tations. The current methods for taking de- mographic processes into account when ana- lyzing data from loci with linked mutations involve extensive simulations of data under various demographic models (3, 75, 116).
SNP Data
With the availability of large-scale SNP data sets, it should, in principle, be possible to provide detailed selection maps in humans and other organisms. Standard methods for detecting selection from population genetics can, in principle, be applied to provide a de- tailed picture of the regions of the genome that may have been targeted by selection. However, most SNP data have been obtained through a complicated SNP discovery process that minimally involves the discovery (or as- certainment) of SNPs in a small sample fol- lowed by genotyping in a larger sample. The process by which the SNPs have been selected affects levels of LD observed in the data (77),
208
Nielsen
Annu. Rev. Genet. 2005.39:197-218. Downloaded from www.annualreviews.org by University of New Mexico on 08/16/11. For personal use only.
the frequency spectrum (77), and levels of population subdivision (74, 115). It also affects the variance in these statistics, complicating genomic methods based on outlier detection. The solution to this problem is to explicitly take the ascertainment process into account. Most statistical methods can be corrected rel- atively easily (76, 77), leading to new valid methods for detecting selection that take the SNP ascertainment process into account. Un- fortunately, most current SNP databases and large-scale SNP genotyping efforts (37) are not associated with sufficiently detailed infor- mation regarding the ascertainment process necessary for appropriate ascertainment bias corrections. At present, it is difficult or impos- sible to make valid inferences regarding selec- tion from most large-scale SNP data sources. It is to be hoped that this will change in the fu- ture as researchers become more aware of the importance in maintaining detailed records regarding SNP ascertainment processes.
Comparative Genomic Data
As more and more genomes are sequenced, comparative approaches for detecting posi- tive selection at a genome-wide scale are be- coming increasingly common (22, 75). The standard methods for detecting positive (or negative) selection using dN /dS ratios can be applied directly in studies on a genomic scale. However, current methods can be improved by establishing models that take advantage of the fact that (ignoring within-species variabil- ity) all genes in a phylogeny share the same evolutionary tree.
FUNCTIONAL INFERENCES
In the field of bioinformatics there has been a long tradition of using conserved sites in comparative data to infer function. The im- plicit assumption is that high levels of con- servation are caused by negative selection against new deleterious selection, i.e., func- tional constraints. In the absence of site- specific suppression of the biological mutation
rate, highly reduced levels of variability must be caused by negative selection.
Phylogenetic Footprinting
Although there exist many methods for quan- tifying how conserved a site, or a set of sites, is, the most statistically solid methods for iden- tifying conserved sites are known as phylo- genetic footprinting. In these methods, the rate of substitution in a particular site (or col- lection of sites) is estimated by considering the pattern of mutation along the underly- ing phylogeny. This is typically done by map- ping mutations onto the phylogenetic tree using parsimony (12) and is complicated by the fact that the alignment may be ambiguous in noncoding regions for divergent species. These methods have been used for a variety of purposes and have been particularly suc- cessful in identifying regulatory elements in noncoding DNA (24, 111). The advantage of these methods is that they explicitly take the underlying evolutionary correlations (the phylogeny) into account, leading to increased statistical power and accuracy over methods that do not consider the phylogeny.
One of the most exciting recent discover- ies in the field of genomics is the presence of extremely conserved regions, with no known function, in mammalian genomes (11). Such regions may be regulatory regions, contain- ing conserved structural features or unanno- tated protein-coding genes or RNA genes. To determine if these regions are truly un- der selection, neutrality tests comparing in- traspecific and interspecific variability could be used. There is even the possibility of posi- tive selection in noncoding regions. More re- search is needed to develop appropriate statis- tical methods for identifying selection outside coding regions from genomic scale compara- tive and population genetic data.
Disease Genetics
In disease genetics, there is an increased awareness that regions of the human genome
www.annualreviews.org • Natural Selection 209
Annu. Rev. Genet. 2005.39:197-218. Downloaded from www.annualreviews.org by University of New Mexico on 08/16/11. For personal use only.
210 Nielsen
that have been targeted by positive selection may be disease associated (7). Disease-causing mutations should affect organismal fitness, ex- cept if the age of onset of the disease is very late. There is, therefore, an intimate relation- ship between disease and selection that poten- tially can be exploited in identifying candidate disease loci and candidate SNPs.
A very promising application is in the iden- tification of putative disease-causing SNPs. Evolutionary inferences from comparative and population genetic data, in combination with functional and structural information, can be used to predict which mutations most likely have negative fitness consequences. The mutations with the most severe fitness con- sequences are obviously the mutations that are most likely to be disease causing. Sev- eral different methods have already been described that allow predicting of poten- tial disease-causing mutation (72, 84). These methods may potentially be improved by us- ing explicit population genetic models. This seems to be a particularly promising ap- plication of PRF models as these models can describe explicitly the selection coeffi- cients acting on particular classes of mutations (15).
Positive Selection
While there has long been a focus on the use of conservation (negative selection) to find functional elements, increased attention has recently been directed toward the pos- sibility of using inferences regarding posi- tive selection to elucidate functional relation- ships. In human genetics, several cases are known where recessive disease-causing mu- tations were thought to be carried to high frequencies in the populations, because they confer a fitness advantage in the heterozy- gote condition. Diseases that have been hy- pothesized to have been targeted by this type of overdominant selection include sickle-cell anemia (42), glucose-6-phosphate dehydro- genasedeficiency(86),Tay-Sachsdisease(99), cystic fibrosis (94), and Phenylketnonuria
(121). Not known is how many of the com- mon disease factors have been influenced by overdominant selection, but these observa- tions do suggest that regions of the human genome that have been targeted by balancing selection may contain disease-causing variants worth exploring.
In virology, site-specific inferences regard- ing positive selection have been used in several cases to identify functionally important sites. In the HIV virus, site-specific inferences of dN /dS ratios have been used to identify posi- tions that may be involved in drug resistance (21). In HIV and other viruses, sites that may interact with the host immune system have been identified by detecting site-specific se- lective pressures, and it has been proposed that such methods may assist in the devel- opment of vaccines (36, 95). It has also been proposed that site-specific inferences of dN /dS ratios may help predict the evolution of viru- lent strains of influenza (14). Recently, site- specific inferences of dN /dS ratios from dif- ferent primate species were used to identify a new species-specific retroviral restriction domain (91).
EVIDENCE FOR SELECTION
There is an increasing amount of evidence that selection is important in shaping varia- tion within and between species. In human SNP data, there is a clear difference in the frequency spectrum between nonsynonymous and synonymous mutations (103, 118). This observation in itself shows that a large pro- portion of the mutations that are segregating in humans (and presumably in other species as well) are affected by selection. In addition, there is a rapidly growing list of specific genes that show evidence for positive selection in both humans and other organisms (7, 31, 98, 113, 125). This explosion of results showing a presence of positive selection may in fact suggest that positive selection is much more common than previously believed. Positively selected mutations may just have remained hidden among all the negatively selected
Annu. Rev. Genet. 2005.39:197-218. Downloaded from www.annualreviews.org by University of New Mexico on 08/16/11. For personal use only.
mutations. In addition, ambiguity in the interpretation of classical population genetic neutrality tests, due to the presence of con- founding demographic factors, may have pre- cluded the establishment of firm conclusions regarding the pervasiveness of selection. As more large-scale data have accumulated, and methods that are robust to demographic as- sumptions have been applied, a clearer pic- ture of the pervasiveness of positive selection has been established. Modern versions of the neutral theory (80, 81) allow for a substan- tial amount of negative selection, and even some positive selection. As the evidence for selection accumulates, the debate regarding the causes of molecular evolution should focus on whether selection is so dominating that ef- fective population sizes and standing levels of variation are best described by the models of repeated selective sweeps favored by Gillespie (40, 41), or whether classical models of ge- netic drift are most appropriate. In the models that Gillespie has proposed, known as genetic draft models, mutations causing species dif- ferences are not neutral mutations increasing in frequency due to genetic drift, but primar- ily neutral mutations increasing in frequency
due to linkage with adaptive mutations sweep- ing through the population. Even though only few mutations are adaptive, the population ge- netic dynamics is determined by the selective forces acting on the adaptive mutations, not by genetic drift. There is no mathematical or empirical evidence to suggest that this model is unrealistic, and as the evidence in favor of positive selection accumulates, the question arises whether models of draft should replace models of drift.
With the new availability of very large population genetic and comparative genomic data sets, we should soon be able to deter- mine how many genes, and how big a pro- portion of mutations, have been affected by positive and negative selection. This will also lead to more evolutionary explorations into the molecular nature of adaptation, help pre- dict which SNPs in humans may be disease associated, and lead to improved functional annotations of genomic data. Methods that combine comparative and population genetic data, and methods that have a high degree of robustness to the underlying demographic factors may be particularly useful in this endeavor.
SUMMARY POINTS
1. Both positive and negative selection leave distinctive signatures at the molecular level that can be detected using statistical tests.
2. In population genetic data, selection may affect levels of variability, linkage disequi- librium, haplotype structure and allelic distribution in each nucleotide site (frequency spectrum). In comparative data, selection has a strong effect on the dN /dS ratio.
3. Statistical methods for detecting selection differ in the assumptions they make and how powerful they are. Most methods applicable to population genetic data rely on strong assumptions regarding the demography of the populations, while comparative methods are free of such assumptions.
4. An increasing amount of evidence suggests that positive selection is much more per- vasive than previously thought.
5. Inferences regarding selection provide a powerful tool in functional studies, for ex- ample for the prediction of possible disease-related genomic regions.
www.annualreviews.org • Natural Selection 211
Annu. Rev. Genet. 2005.39:197-218. Downloaded from www.annualreviews.org by University of New Mexico on 08/16/11. For personal use only.
UNRESOLVED ISSUES
1. Can robust statistical population genetic tests be developed that can help identify genomic regions targeted by positive selection?
2. Will inferences regarding selection help identify disease loci in humans and other organisms?
3. Should we focus on genetic draft instead of genetic drift?
A related review focusing on the problem of distinguishing background selection from selective sweeps, with particular focus on Drosophila populations.
LITERATURE CITED
1. Akey JM, Zhang G, Zhang K, Jin L, Shriver MD. 2002. Interrogating a high-density SNP map for signatures of natural selection. Genome Res. 12:1805–14
2. AndolfattoP.2001.Adaptivehitchhikingeffectsongenomevariability.Curr.Opin. Genet. Dev. 11:635–41
3. Andolfatto P, Przeworski M. 2000. A genome-wide departure from the standard neutral model in natural populations of Drosophila. Genetics 156:257–68
4. Andolfatto P, Wall JD, Kreitman M. 1999. Unusual haplotype structure at the proximal breakpoint of In(2L)t in a natural population of Drosophila melanogaster. Genetics 153:1297– 311
5. Anisimova M, Bielawski JP, Yang ZH. 2001. Accuracy and power of the likelihood ratio test in detecting adaptive molecular evolution. Mol. Biol. Evol. 18:1585–92
6. AnisimovaM,NielsenR,YangZH.2003.Effectofrecombinationontheaccuracyofthe likelihood method for detecting positive selection at amino acid sites. Genetics 164:1229– 36
7. Bamshad M, Wooding SP. 2003. Signature of natural selection in the human genome. Nat. Rev. Genet. 4:99
8. Barton NH. 1995. Linkage and the limits to natural selection. Genetics 140:821–41
9. Beaumont MA, Balding DJ. 2004. Identifying adaptive genetic divergence among popu-
lations from genome scans. Mol. Ecol. 13:969–80
10. Beaumont MA, Nichols RA. 1996. Evaluating loci for use in the genetic analysis of
population structure. Proc. R. Soc. London Ser. B 263:1619–26
11. Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, et al. 2004. Ultraconserved
elements in the human genome. Science 304:1321–25
12. Blanchette M, Tompa M. 2002. Discovery of regulatory elements by a computational
method for phylogenetic footprinting. Genome Res. 12:739–48
13. BravermanJM,HudsonRR,KaplanNL,LangleyCH,StephanW.1995.Thehitchhiking
effect on the site frequency-spectrum of DNA polymorphisms. Genetics 140:783–96
14. Bush RM, Bender CA, Subbarao K, Cox NJ, Fitch WM. 1999. Predicting the evolution
of human influenza A. Science 286:1921–25
15. BustamanteCD,NielsenR,HartlDL.2003.MaximumlikelihoodandBayesianmethods
for estimating the distribution of selective effects among classes of mutations using DNA
polymorphism data. Theor. Popul. Biol. 63:91–103
16. Bustamante CD, Nielsen R, Sawyer SA, Olsen KM, Purugganan MD, Hartl DL. 2002.
The cost of inbreeding in Arabidopsis. Nature 416:531–34
17. Bustamante CD, Wakeley J, Sawyer S, Hartl DL. 2001. Directional selection and the
212
Nielsen
site-frequency spectrum. Genetics 159:1779–88
Annu. Rev. Genet. 2005.39:197-218. Downloaded from www.annualreviews.org by University of New Mexico on 08/16/11. For personal use only.
18. Charlesworth B. 1994. The effect of background selection against deleterious mutations on weakly selected, linked variants. Genet. Res. 63:213–27
19. CharlesworthB,MorganMT,CharlesworthD.1993.Theeffectofdeleteriousmutations on neutral molecular variation. Genetics 134:1289–303
20. Charlesworth B, Nordborg M, Charlesworth D. 1997. The effects of local selection, balanced polymorphism and background selection on equilibrium patterns of genetic diversity in subdivided populations. Genet. Res. 70:155–74
21. ChenL,PerlinaA,LeeCJ.2004.Positiveselectiondetectionin40,000humanimmuno- deficiency virus (HIV) type 1 sequences automatically identifies drug resistance and pos- itive fitness mutations in HIV protease and reverse transcriptase. J. Virol. 78:3722–32
22. ClarkAG,GlanowskiS,NielsenR,ThomasPD,KejariwalA,etal.2003.Inferringnon- neutral evolution from human-chimp-mouse orthologous gene trios. Science 302:1960– 63
23. Depaulis F, Veuille M. 1998. Neutrality tests based on the distribution of haplotypes under an infinite-site model. Mol. Biol. Evol. 15:1788–90
24. Duret L, Bucher P. 1997. Searching for regulatory elements in human noncoding se- quences. Curr. Opin. Struct. Biol. 7:399–406
25. Ewens WJ. 2004. Mathematical Population Genetics. I. Theoretical Introduction. Berlin/Heidelberg/New York: Springer
26. Eyre-Walker A. 2002. Changing effective population size and the McDonald-Kreitman test. Genetics 162:2017–24
27. Fares MA, Moya A, Escarmis C, Baranowski E, Domingo E, Barrio E. 2001. Evidence for positive selection in the capsid protein-coding region of the foot-and-mouth disease virus (FMDV) subjected to experimental passage regimens. Mol. Biol. Evol. 18:10
28. FayJC,WuCI.2000.HitchhikingunderpositiveDarwinianselection.Genetics155:1405– 13
29. Felsenstein J. 1974. Evolutionary advantage of recombination. Genetics 78:737–56
30. Fitch WM, Bush RM, Bender CA, Cox NJ. 1997. Long term trends in the evolution of
H(3) HA1 human influenza type A. Proc. Natl. Acad. Sci. USA 94:7712–18
31. Ford MJ. 2002. Applications of selective neutrality tests to molecular ecology. Mol. Ecol.
11:1245–62
32. Fu YX. 1996. New statistical tests of neutrality for DNA samples from a population.
Genetics 143:557–70
33. FuYX.1997.Statisticaltestsofneutralityofmutationsagainstpopulationgrowth,hitch-
hiking and background selection. Genetics 147:915–25
34. Fu YX, Li WH. 1993. Statistical tests of neutrality of mutations. Genetics 133:693–709
35. Galtier N, Depaulis F, Barton NH. 2000. Detecting bottlenecks and selective sweeps
from DNA sequence polymorphism. Genetics 155:981–87
36. Gaschen B, Taylor J, Yusim K, Foley B, Gao F, et al. 2002. Diversity considerations in
HIV-1 vaccine selection. Science 299:1515–18
37. Gibbs RA, Belmont JW, Hardenbol P, Willis TD, Yu FL, et al. 2003. The International
HapMap Project. Nature 426:789–96
38. GillespiJH,LangleyCH.1974.Generalmodeltoaccountforenzymevariationinnatural
populations. Genetics 76:837–84
39. GillespieJH.1991.TheCausesofMolecularEvolution.NewYork:OxfordUniv.Press.336
pp.
40. GillespieJH.2000.Geneticdriftinaninfinitepopulation:thepseudohitchhikingmodel.
A great introduction to population genetic theory for the mathematically literate.
Genetics 155:909–19
www.annualreviews.org • Natural Selection
213
Annu. Rev. Genet. 2005.39:197-218. Downloaded from www.annualreviews.org by University of New Mexico on 08/16/11. For personal use only.
214 Nielsen
41. Gillespie JH. 2001. Is the population size of a species relevant to its evolution? Evolution 55:2161–69
42. Haldane JBS. 1949. Disease and evolution. Ricerca Sci. 19:3–10
43. Haldane JBS. 1957. The cost of natural selection. Genetics 55:511–24
44. Harr B, Kauer M, Schlotterer C. 2002. Hitchhiking mapping: a population-based fine-
mapping strategy for adaptive mutations in Drosophila melanogaster. Proc. Natl. Acad. Sci.
USA 99:12949–54
45. Haydon DT, Bastos AD, Knowles NJ, Samuel AR. 2001. Evidence for positive selection
in foot-and-mouth disease virus capsid genes from field isolates. Genetics 157:7
46. Henikoff S, Malik HS. 2002. Selfish drivers. Nature 417:227
47. Hudson RR, Bailey K, Skarecky D, Kwiatowski J, Ayala FJ. 1994. Evidence for positive
selection in the superoxide-dismutase (Sod) region of Drosophila-melanogaster. Genetics
136:1329–40
48. Hudson RR, Kreitman M, Aguade M. 1987. A test of neutral molecular evolution based
on nucleotide data. Genetics 116:153–59
49. Hughes AL, Nei M. 1988. Pattern of nucleotide substitution at major histocompatibility
complex class-I loci reveals overdominant selection. Nature 335:167–70
50. Ingvarsson PK. 2004. Population subdivision and the Hudson-Kreitman-Aguade test: testing for deviations from the neutral model in organelle genomes. Genet. Res. 83:31–
39
51. Kaplan NL, Darden T, Hudson RR. 1988. The coalescent process in models with selec-
tion. Genetics 120:819–29
52. Kaplan NL, Hudson RR, Langley CH. 1989. The hitchhiking effect revisited. Genetics
123:887–99
53. Kayser M, Brauer S, Stoneking M. 2003. A genome scan to detect candidate regions
influenced by local natural selection in human populations. Mol. Biol. Evol. 20:893–
900
54. Kelly JK. 1997. A test of neutrality based on interlocus associations. Genetics 146:1197–
206
55. KimY,NielsenR.2004.Linkagedisequilibriumasasignatureofselectivesweeps.Genetics
167:1513–24
56. Kim Y, Stephan W. 2002. Detecting a local signature of genetic hitchhiking along a
recombining chromosome. Genetics 160:765–77
57. KimY,StephanW.2003.Selectivesweepsinthepresenceofinterferenceamongpartially
linked loci. Genetics 164:389–98
58. Kimura M. 1968. Evolutionary rate at the molecular level. Nature 217:624
59. Kimura M. 1983. The Neutral Theory of Molecular Evolution. New York: Cambridge Univ.
Press. 367 pp.
60. Kimura M. 1995. Limitations of Darwinian selection in a finite population. Proc. Natl.
Acad. Sci. USA 92:2343–44
61. Kimura M, Crow J. 1964. The number of alleles that can be maintained in a finite
population. Genetics 40:725–38
62. Kondrashov AS. 1982. Selection against harmful mutations in large sexual and asexual
populations. Genet. Res. 40:325–32
63. Lewontin RC, Krakauer J. 1973. Distribution of gene frequency as a test of theory of
selective neutrality of polymorphisms. Genetics 74:175–95
64. Luikart G, England PR, Tallmon D, Jordan S, Taberlet P. 2003. The power and promise
of population genomics: from genotyping to genome typing. Nat. Rev. Genet. 4:981–94
Annu. Rev. Genet. 2005.39:197-218. Downloaded from www.annualreviews.org by University of New Mexico on 08/16/11. For personal use only.
65. Majewski J, Cohan FM. 1999. Adapt globally, act locally: the effect of selective sweeps on bacterial sequence diversity. Genetics 152:1459–74
66. Malik HS, Henikoff S. 2001. Adaptive evolution of cid, a centromere-specific histone in Drosophila. Genetics 157:1293–98
67. Malik HS, Henikoff S. 2002. Conflict begets complexity: the evolution of centromeres. Curr. Opin. Genet. Dev. 12:711–18
68. Maynard Smith J, Haigh J. The hitch-hiking effect of a favourable gene. Genet. Res. 23:23–35
69. McDonald JH, Kreitman M. 1991. Adaptive protein evolution at the Adh locus in Drosophila. Nature 351:652–54 sweep.
70. McVeanGAT,MyersSR,HuntS,DeloukasP,BentleyDR,DonnellyP.2004.Thefine- scale structure of recombination rate variation in the human genome. Science 304:581– 84
71. Milkman RD. 1967. Heterosis as a major cause of heterozygosity in nature. Genetics 55:493–95
72. Ng PC, Henikoff S. 2001. Predicting deleterious amino acid substitutions. Genome Res. 11:863–74
73. Nielsen R. 2001. Statistical tests of selective neutrality in the age of genomics. Heredity 86:641–47
74. Nielsen R. 2004. Population genetic analysis of ascertained SNP data. Human Genomics 3:218–24
75. Nielsen R, Bustamante CD, Clark AG, Glanowski S, Sackton TB, et al. 2005. A scan for positively selected genes in the genomes of humans and chimpanzees. PLoS Biol. In press
76. NielsenR,HubiszMJ,ClarkAG.2004.Reconstitutingthefrequencyspectrumofascer-
tained single-nucleotide polymorphism data. Genetics 168:2373–82
77. Nielsen R, Signorovitch J. 2003. Correcting for ascertainment biases when analyzing SNP data: applications to the estimation of linkage disequilibrium. Theor. Popul. Biol.
63:245–55
78. Nielsen R, Yang ZH. 1998. Likelihood models for detecting positively selected amino
acid sites and applications to the HIV-1 envelope gene. Genetics 148:929–36
79. Nielsen R, Yang ZH. 2003. Estimating the distribution of selection coefficients from phylogenetic data with applications to mitochondrial and viral DNA. Mol. Biol. Evol.
20:1231–39
80. Ohta T. 1992. The nearly neutral theory of molecular evolution. Annu. Rev. Ecol. Syst.
23:263–86
81. OhtaT.2002.Near-neutralityinevolutionofgenesandgeneregulation.Proc.Natl.Acad.
Sci. USA 99:16134–37
82. PrzeworskiM.2002.Thesignatureofpositiveselectionatrandomlychosenloci.Genetics
160:1179–89
83. ̈Ptak SE, Roeder AD, Stephens M, Gilad Y, Paabo S, Przeworski M. 2004. Absence of
the TAP2 human recombination hotspot in chimpanzees. PLoS Biol. 2:849–55
84. RamenskyV,BorkP,SunyaevS.2002.Humannon-synonymousSNPs:serverandsurvey.
Nucleic Acids Res. 30:3894–900
85. Ramos-Onsins SE, Rozas J. 2002. Statistical properties of new neutrality tests against
population growth. Mol. Biol. Evol. 19:2092–100
86. Ruwende C, Khoo SC, Snow AW, Yates SNR, Kwiatkowski D, et al. 1995. Natural-
selection of hemizygotes and heterozygotes for G6pd deficiency in Africa by resistance to severe malaria. Nature 376:246–49
The classic paper introducing the idea of a selective
www.annualreviews.org • Natural Selection 215
Annu. Rev. Genet. 2005.39:197-218. Downloaded from www.annualreviews.org by University of New Mexico on 08/16/11. For personal use only.
The first paper introducing PRF models as a statistical framework for population genetic inferences.
An elegant paper showing how inferences regarding selection can be used to make functional predictions that can be tested in the lab.
87. Sabeti PC, Reich DE, Higgins JM, Levine HZP, Richter DJ, et al. 2002. Detecting recent positive selection in the human genome from haplotype structure. Nature 419:832– 37
88. SawyerSA,HartlDL.1992.Populationgeneticsofpolymorphismanddivergence. Genetics 132:1161–76
89. Sawyer SA, Kulathinal RJ, Bustamante CD, Hartl DL. 2003. Bayesian analysis suggests that most amino acid replacements in Drosophila are driven by positive selection. J. Mol. Evol. 57(Suppl.) 1:S154–64
90. Sawyer SL, Emerman M, Malik HS. 2004. Ancient adaptive evolution of the primate antiviral DNA-editing enzyme APOBEC3G. PLoS Biol. 2:1278–85
91. Sawyer SL, Wu LI, Emerman M, Malik HS. 2005. Positive selection of primate TRIM5 alpha identifies a critical species-specific retroviral restriction domain. Proc. Natl. Acad. Sci. USA 102:2832–37
92. Schlotterer C. 2002. A microsatellite-based multilocus screen for the identification of local selective sweeps. Genetics 160:753–63
93. Schlotterer C. 2003. Hitchhiking mapping—functional genomics from the population genetics perspective. Trends Genet. 19:32–38
94. SchroederSA,GaughanDM,SwiftM.1995.Protectionagainstbronchial-asthmabyCftr Delta-F508 mutation—a heterozygote advantage in cystic-fibrosis. Nat. Med. 1:703–5
95. Sheridan I, Pybus OG, Holmes EC, Klenerman P. 2004. High-resolution phylogenetic analysis of Hepatitis C virus adaptation and its relationship to disease progression. J. Virol. 78:3447–54
96. Simonsen KL, Churchill GA, Aquadro CF. 1995. Properties of statistical tests of neu- trality for DNA polymorphism data. Genetics 141:413–29
97. Slatkin M, Wiehe T. 1998. Genetic hitch-hiking in a subdivided population. Genet. Res. 71:155–60
98. Smith NGC, Eyre-Walker A. 2002. Adaptive protein evolution in Drosophila. Nature 415:1022–24
99. Spyropoulos B, Moens PB, Davidson J, Lowden JA. 1981. Heterozygote advantage in Tay-Sachs carriers. Am. J. Hum. Genet. 33:375–80
100. StahlEA,BishopJG.2000.Plant-pathogenarmsracesatthemolecularlevel.Curr.Opin. Plant Biol. 3:299
101. Storz JF. 2005. Using genome scans of DNA polymorphism to infer adaptive pop- ulation divergence. Mol. Ecol. 14:671–88
102. StorzJF,PayseurBA,NachmanMW.2004.GenomescansofDNAvariabilityinhumans reveal evidence for selective sweeps outside of Africa. Mol. Biol. Evol. 21:1800–11
103. Sunyaev SR III WCL, Ramensky VE, Bork P. 2000. SNP frequencies in human genes an excess of rare alleles and differing modes of selection. Trends Genet. 16:335–37
104. Suzuki Y, Gojobori T. 1999. A method for detecting positive selection at single amino acid sites. Mol. Biol. Evol. 16:1315–28
105. SuzukiY,NeiM.2002.Simulationstudyofthereliabilityandrobustnessofthestatistical methods for detecting positive selection at single amino acid sites. Mol. Biol. Evol. 19:1865– 69
106. Suzuki Y, Nei M. 2004. False-positive selection identified by ML-based methods: exam- ples from the Sig1 gene of the diatom Thalassiosira weissflogii and the tax gene of a human T-cell lymphotropic virus. Mol. Biol. Evol. 21:914–21
107. SvedJA,ReedTE,BodmerWF.1967.Thenumberofbalancedpolymorphismsthatcan be maintained in a natural population. Genetics 55:469–81
A related review of methods for detecting selection in genomic scans, with particular focus on statistics based on population differentiation.
216
Nielsen
Annu. Rev. Genet. 2005.39:197-218. Downloaded from www.annualreviews.org by University of New Mexico on 08/16/11. For personal use only.
108. Swanson WJ, Aquadro CF, Vacquier VD. 2001. Polymorphism in abalone fertilization proteins is consistent with the neutral evolution of the egg’s receptor for lysin (VERL) and positive Darwinian selection of sperm lysin. Mol. Biol. Evol. 18:376–83
109. Swanson WJ, Nielsen R, Yang QF. 2003. Pervasive adaptive evolution in mammalian fertilization proteins. Mol. Biol. Evol. 20:18–20
110. SwansonWJ,ZhangZH,WolfnerMF,AquadroCF.2001.PositiveDarwinianselection drives the evolution of several female reproductive proteins in mammals. Proc. Natl. Acad. Sci. USA 98:2509–14
111. Tagle D, Koop B, Goodman M, Slightom J, Hess D, Jones R. 1988. Embryonic ε and γ globin genes of a prosimian primate (Galago crassicaudatus); nucleotide and amino acid sequences, developmental regulation and phylogenetic footprints. J. Mol. Biol. 203:439– 55
112. Tajima F. 1989. Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. Genetics 123:585–95
113. VallenderEJ,LahnBT.2004.Positiveselectiononthehumangenome.Hum.Mol.Genet. 13:R245–54
114. Vitalis R, Dawson K, Boursot P. 2001. Interpretation of variation across marker loci as evidence of selection. Genetics 158:1811–23
115. Wakeley J, Nielsen R, Liu-Cordero SN, Ardlie K. 2001. The discovery of single- nucleotide polymorphisms—and inferences about human demographic history. Am. J. Hum. Genet. 69:1332–47
116. Wall JD, Andolfatto P, Przeworski M. 2002. Testing models of selection and de- mography in Drosophila simulans. Genetics 162:203–16
117. WallJD,FrisseLA,HudsonRR,DiRienzoA.2003.Comparativelinkage-disequilibrium analysis of the beta-globin hotspot in primates. Am. J. Hum. Genet. 73:1330– 40
118. Williamson SH, Hernadez R, Fledel-Alon A, Zhu L, Nielsen R, Bustamante CD. 2005. Simultaneous inference of selection and population growth from patterns of variation in the human genome. Proc. Natl. Acad. Sci. USA. 102:7882–87
119. Wong WSW, Nielsen R. 2004. Detecting selection in noncoding regions of nucleotide sequences. Genetics 167:949–58
120. Wong WSW, Yang ZH, Goldman N, Nielsen R. 2004. Accuracy and power of statistical methods for detecting adaptive evolution in protein coding sequences and for identifying positively selected sites. Genetics 168:1041–51
121. Woolf LI, McBean MS, Woolf FM, Cahalane SF. 1975. Phenylketonuria as a balanced polymorphism—nature of heterozygote advantage. Ann. Hum. Genet. 38:461–69
122. Wyckoff GJ, Wang W, Wu CI. 2000. Rapid evolution of male reproductive genes in the descent of man. Nature 403:304
123. Yang ZH. 1998. Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution. Mol. Biol. Evol. 15:568–73
124. Yang ZH. 2002. Inference of selection from multiple species alignments. Curr. Opin. Genet. Dev. 12:688–94
125. Yang ZH, Bielawski JP. 2000. Statistical methods for detecting molecular adapta- tion. Trends. Ecol. Evol. 15:496–503
126. Yang ZH, Nielsen R. 1998. Synonymous and nonsynonymous rate variation in nuclear genes of mammals. J. Mol. Evol. 46:409–18
127. Yang ZH, Nielsen R. 2000. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol. Biol. Evol. 17:32–43
Tajima’s classic paper introducing his well-known neutrality test.
A nice study showing how both selection and demographymust be taken into account when interpreting genetic data.
A review of the statistical methodology used to detect positive selection using dN/dS ratios.
www.annualreviews.org • Natural Selection
217
Annu. Rev. Genet. 2005.39:197-218. Downloaded from www.annualreviews.org by University of New Mexico on 08/16/11. For personal use only.
128. Yang ZH, Nielsen R. 2002. Codon-substitution models for detecting molecular adapta- tion at individual sites along specific lineages. Mol. Biol. Evol. 19:908–17
129. YangZH,NielsenR,GoldmanN,PedersenAMK.2000.Codon-substitutionmodelsfor heterogeneous selection pressure at amino acid sites. Genetics 155:431–49
RELATED RESOURCES
Fay JC, Wu C-I. 2003. Sequence divergence, functional constraint, and selection in protein evolution. Annu. Rev. Genomics Hum. Genet. 4:213–35
Lewontin RC. 2002. Directions in evolutionary biology. Annu Rev. Genet. 36:1–18
Kreitman M. 2000. Methods to detect selection in populations with applications to the human. Annu. Rev. Genomics Hum. Genet. 1:539–59
218 Nielsen
Annu. Rev. Genet. 2005.39:197-218. Downloaded from www.annualreviews.org by University of New Mexico on 08/16/11. For personal use only.
Contents
John Maynard Smith
Richard E. Michod p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 1
The Genetics of Hearing and Balance in Zebrafish
Teresa Nicolson p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 9
Immunoglobulin Gene Diversification
NancyMaizels ppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp23
Complexity in Regulation of Tryptophan Biosynthesis in
Bacillus subtilis
PaulGollnick,PaulBabitzke,AlfredAntson,andCharlesYanofsky ppppppppppppppppppppppp47
Cell-Cycle Control of Gene Expression in Budding and Fission Yeast
JürgBa ̈hler pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp69
Comparative Developmental Genetics and the Evolution of Arthropod Body Plans
DavidR.AngeliniandThomasC.Kaufman pppppppppppppppppppppppppppppppppppppppppppppppp95 Concerted and Birth-and-Death Evolution of Multigene Families
MasatoshiNeiandAlejandroP.Rooney pppppppppppppppppppppppppppppppppppppppppppppppppppp121 Drosophila as a Model for Human Neurodegenerative Disease
JulideBilenandNancyM.Bonini ppppppppppppppppppppppppppppppppppppppppppppppppppppppppp153 Molecular Mechanisms of Germline Stem Cell Regulation
MarcoD.Wong,ZhigangJin,andTingXie pppppppppppppppppppppppppppppppppppppppppppppp173 Molecular Signatures of Natural Selection
RasmusNielsen pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp197 T-Box Genes in Vertebrate Development
L.A.Naiche,ZacharyHarrelson,RobertG.Kelly,andVirginiaE.Papaioannou pppppp219
Connecting Mammalian Genome with Phenome by ENU Mouse Mutagenesis: Gene Combinations Specifying the Immune System
PeterPapathanasiouandChristopherC.Goodnow pppppppppppppppppppppppppppppppppppppppp241
Evolutionary Genetics of Reproductive Behavior in Drosophila: Connecting the Dots
Annual Review of Genetics
Volume 39, 2005
PatrickM.O’GradyandThereseAnneMarkow ppppppppppppppppppppppppppppppppppppppppp263
v
Annu. Rev. Genet. 2005.39:197-218. Downloaded from www.annualreviews.org by University of New Mexico on 08/16/11. For personal use only.
Sex Determination in the Teleost Medaka, Oryzias latipes
MasuraMatsuda pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp293
Orthologs, Paralogs, and Evolutionary Genomics
EugeneV.Koonin pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp309
The Moss Physcomitrella patens
DavidCove pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp339
A Mitochondrial Paradigm of Metabolic and Degenerative Diseases, Aging, and Cancer: A Dawn for Evolutionary Medicine
Douglas C. Wallace p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 359
Switches in Bacteriophage Lambda Development
Amos B. Oppenheim, Oren Kobiler, Joel Stavans, Donald L. Court,
andSankarAdhya ppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp409
Nonhomologous End Joining in Yeast
James M. Daley, Phillip L. Palmbos, Dongliang Wu, and Thomas E. Wilson p p p p p p p p p p 431
Plasmid Segregation Mechanisms
GitteEbersbachandKennGerdes pppppppppppppppppppppppppppppppppppppppppppppppppppppppppp453
Use of the Zebrafish System to Study Primitive and Definitive Hematopoiesis
JillL.O.deJongandLeonardI.Zon pppppppppppppppppppppppppppppppppppppppppppppppppppppp481
Mitochondrial Morphology and Dynamics in Yeast and Multicellular Eukaryotes
KojiOkamotoandJanetM.Shaw ppppppppppppppppppppppppppppppppppppppppppppppppppppppppp503
RNA-Guided DNA Deletion in Tetrahymena: An RNAi-Based Mechanism for Programmed Genome Rearrangements
Meng-ChaoYaoandJu-LanChao ppppppppppppppppppppppppppppppppppppppppppppppppppppppppp537 Molecular Genetics of Axis Formation in Zebrafish
AlexanderF.SchierandWilliamS.Talbot pppppppppppppppppppppppppppppppppppppppppppppppp561 Chromatin Remodeling in Dosage Compensation
JohnC.Lucchesi,WilliamG.Kelly,andBarbaraPanning pppppppppppppppppppppppppppppp615 INDEXES
Subject Index p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 653 ERRATA
An online log of corrections to Annual Review of Genetics
chapters may be found at http://genet.annualreviews.org/errata.shtml
vi Contents
Annu. Rev. Genet. 2005.39:197-218. Downloaded from www.annualreviews.org by University of New Mexico on 08/16/11. For personal use only.