The Atavism

Thursday, August 9, 2012

Measuring population differentiation in R
This is a little bit different than most posts here. I have a paper out today in Molecular Ecology Resources:  "mmod: an R library for the calculation of population differentiation statistics" (doi: 10.1111/j.1755-0998.2012.03174.x). Looking around the web, there aren't many simple expositions of just what a "differentiation statistic" might be, and why the "modern measures of differentiation" my little R package can calculate might improve on the more traditional ones. So,  I thought I'd have a go here. 

Biologists often want to be able to measure the degree to which a population is divided into smaller sub-populations. This can be an important thing to quantify, because sub-populations within highly structured populations are, to some extent, genetically distinct from other sub-populations and therefore have their own evolutionary histories (and perhaps futures).

To illustrate this point I've run some simulations. Imagine if we had 5 subpopulations, each with a thousand individuals. In each population we will follow the fate of a locus with two alleles, R and r that have no effect on survival or reproduction and start with frequencies 0.8 and 0.2 respectively (these numbers motivated by this post). In the absence of gene flow between these populations (Panel 1) the frequency of the r allele bounces around due to genetetic drift (evolutionary change, after all, is inevitable). Crucially though, changes in one population can't effect other populations so we end up with substantial among-population differences in allele frequency. In the next two panels, in each generation a proportion of each population's individuals (0.001 and 0.01 respectively) are drawn from the other populations in the simulation. Now that the populations are sharing genes the lines that represent their allele frequencies pull together  (that is, the among-population variation is reduced). 


One way to quantify the among-population variation displayed in these simulations is to look at the number of heterozygotes you expect to observe across the entire population. The final values for P(r) in the first simulation were {0.33, 0.47. 0.88. 0.10. 0.33} with a mean frequency of 0.42 (so the frequency of the R allele would be 0.58). Knowing our Hardy Weinberg, if we had one big population with two alleles, one being at a frequency of 0.42 we'd expect to get 2pq = 2 * 0.42 * 0.58 = 0.40 heterozygotes. We can call that number Hfor expected total heterozygosity. But thats not what we'd actually see in this case. The sub-populations that make up this larger population have their own allele frequencies, when we calculate the expected proportion of heterozygotes for each of these populations by themselves we end up with {0.44, 0.49, 0.21, 0.18, 0.44} for a within-population expected heterozygosity (HS) of 0.35*. This lack of heterozygotes within sub-populations compared with the total population expectation will always arise when genetic drift makes sub-populations distinct from each other.  Masatoshi Nei  used this pattern to propose a statistic to quantify population divergence called GST, which he defined like this:


Nei's motivaton with GST was to generalise Sewall Wright's FST **, which was defined for diploid organisms and two-allele systems, so that it could be used for any genetic data. But there's a problem with this formulation. Because HT  is always larger than H and can't be greater than one, the maximum possible value of  GST  is 1-HS. This dependency on the within-population genetic diversity means comparisons between studies, and even between loci in one study, are difficult (since Hwill likely be different in each case). This is particularly worryingly for highly polymorphic makers like microsatellites, which can give values of HS as high as 0.9, severely constraining the possible values of GST.

Although the problem of  GST's dependence on HS has been known for a while, it's taken some time for new statistics that get around this problem to be developed. Philip Hedrick (doi: 10.1554/05-076.1) along with Patrick Meirmans (doi: 10.1111/j.1755-0998.2010.02927.x) introduced G''ST  - a version of GST that is corrected for the observed value of HS as well as the number of sub-populations being considered. Meirmans used a similar trick to define φ'ST  (doi: 10.1111/j.0014-3820.2006.tb01874.x), another FST analogue that partitions genetic distances into within- and between-population components. Most recently, Lou Joust introduced an entirely separate statistic, D, that  directly measures allelic divergence (doi 10.1111/j.1365-294X.2008.03887.x). 

The statistical programming language R is becoming increasingly popular among biologists. Although there is a strong suite of tools for performing population genetic analyses in R, code to calculate these "new" measures of population divergence have not been available. My package, mmod, fills this gap.  I won't give too many details of the package here, as that's detailed in the paper and the package is will documented. Briefly, mmod has functions to calculate the three statistics described above (and Nei's  GST ), as well as pairwise versions of each statistic for every population in a datastet. It also allows users to perform bootstrap and jacknife re-sampling of datasets, the results of which are returned as user-accessable objects which can be examined with any R function (there is also a helper function to easily apply differentiation statistics to bootstrap sample and summarise the results) . The library is on CRAN, so installation is as easy as typing "install.pacakge("mmod")", the source code is up on github. If want to use the package I'd suggest reading the vignette ("mmod-demo") before you dive in.

I'm keen to hear about bugs or feature requests from users, just email them to


Winter, D.J. (in press). MMOD: an R library for the calculation of population differentiation statisticsMolecular Ecology Resources :

* mmod actually uses nearly unbiased estimators for these parameters, to deal with the way small population samples can mis-represent the actual allele frequencies in populations.

** I don't want to write an entire history of F-statisitcs here, because it's a big and murky topic, but I did want to make the point that the formulation I gave for GST  is often presented as "Wright's FST " in genetics courses. Wright was certainly aware that his statistic was related to the proportion of heterozygotes you expect to get in a populaiton, but, when he introduced F-statistics in general, and FST  in particular, he was really dealing with correlation among gametes at various levels of population structure. Unfortunately, there are now many many definitions of FST  floating around, and it's probably pointless to argue about a "right one". If you use my package I encourage you to be explicit about, and cite, the particular statistic that you are using. For each of the the FST  analogues that the package calculates the in-line help contains the correct reference. 

Labels: , , ,

Posted by David Winter 11:00 AM


Post a Comment