Learn R Programming

polysat (version 1.7-7)

calcPopDiff: Estimate Population Differentiation Statistics

Description

Given a data frame of allele frequencies and population sizes, calcPopDiff calculates a matrix of pairwise \(F_{ST}\), \(G_{ST}\), Jost's \(D\), or \(R_{ST}\) values, or a single global value for any of these four statistics. calcFst is a wrapper for calcPopDiff to allow backwards compatibility with previous versions of polysat.

Usage

calcPopDiff(freqs, metric, pops = row.names(freqs), 
            loci = unique(gsub("\\..*$", "", names(freqs))), global = FALSE,
            bootstrap = FALSE, n.bootstraps = 1000, object = NULL)

calcFst(freqs, pops = row.names(freqs), loci = unique(gsub("\\..*$", "", names(freqs))), ...)

Value

If global = FALSE and bootstrap = FALSE, a square matrix containing \(F_{ST}\), \(G_{ST}\), \(D\), or \(R_{ST}\) values. The rows and columns of the matrix are both named by population.

If global = TRUE and bootstrap = FALSE, a single value indicating the \(F_{ST}\), \(G_{ST}\), \(D\), or \(R_{ST}\) value.

If global = TRUE and bootstrap = TRUE, a vector of bootstrapped replicates of \(F_{ST}\), \(G_{ST}\), \(D\), or \(R_{ST}\).

If global = FALSE and bootstrap = TRUE, a square two-dimensional list, with rows and columns named by population. Each item is a vector of bootstrapped values for \(F_{ST}\), \(G_{ST}\), \(D\), or \(R_{ST}\) for that pair of populations.

Arguments

freqs

A data frame of allele frequencies and population sizes such as that produced by simpleFreq or deSilvaFreq. Each population is in one row, and a column called Genomes (or multiple columns containing the locus names and “Genomes” seperated by a period) contains the relative size of each population. All other columns contain allele frequencies. The names of these columns are the locus name and allele name, separated by a period.

metric

The population differentiation statistic to estimate. Can be “Fst”, “Gst”, “Jost's D”, or “Rst”.

pops

A character vector. Populations to analyze, which should be a subset of row.names(freqs).

loci

A character vector indicating which loci to analyze. These should be a subset of the locus names as used in the column names of freqs.

global

Boolean. If TRUE, a single global statistic will be estimated across all populations. If FALSE, pairwise statistics will be estimated between populations.

bootstrap

Boolean. If TRUE, a set of replicates bootstrapped across loci will be returned. If FALSE, a single value will be returned for each pair of populations (if global = FALSE) or for the whole set (if global = TRUE).

n.bootstraps

Integer. The number of bootstrap replicates to perform. Ignored if bootstrap = FALSE.

object

A "genambig" or "genbinary" object with the Usatnts slot filled in. Required for metric = "Rst" only.

...

Additional arguments to be passed to calcPopDiff (i.e. global, bootstrap, and/or n.bootstraps).

Author

Lindsay V. Clark

Details

For metric = "Fst" or calcFst:

\(H_S\) and \(H_T\) are estimate directly from allele frequencies for each locus for each pair of populations, then averaged across loci. Wright's \(F_{ST}\) is then calculated for each pair of populations as \(\frac{H_T - H_S}{H_T}\).

\(H\) values (expected heterozygosities for populations and combined populations) are calculated as one minus the sum of all squared allele frequencies at a locus. To calculte \(H_T\), allele frequencies between two populations are averaged before the calculation. To calculate \(H_S\), \(H\) values are averaged after the calculation. In both cases, the averages are weighted by the relative sizes of the two populations (as indicated by freqs$Genomes).

For metric = "Gst":

This metric is similar to \(F_{ST}\), but mean allele frequencies and mean expected heterozygosities are not weighted by population size. Additionally, unbiased estimators of \(H_S\) and \(H_T\) are used according to Nei and Chesser (1983; equations 15 and 16, reproduced also in Jost (2008)). Instead of using twice the harmonic mean of the number of individuals in the two subpopulations as described by Nei and Chesser (1983), the harmonic mean of the number of allele copies in the two subpopulations (taken from freq$Genomes) is used, in order to allow for polyploidy. \(G_{ST}\) is estimated for each locus and then averaged across loci.

For metric = "Jost's D":

The unbiased estimators of \(H_S\) and \(H_T\) and calculated as with \(G_{ST}\), without weighting by population size. They are then used to estimate \(D\) at each locus according to Jost (2008; equation 12):

$$2 * \frac{H_T - H_S}{1 - H_S}$$

Values of \(D\) are then averaged across loci.

For metric = "Rst":

\(R_{ST}\) is estimated on a per-locus basis according to Slatkin (1995), but with populations weighted equally regardless of size. Values are then averaged across loci.

For each locus:

$$S_w = \frac{1}{d} * \sum_j^d{\frac{\sum_i{\sum_{i' < i}{p_{ij} * p_{i'j} * n_j^2 * (a_i - a_{i'})^2}}}{n_j * (n_j - 1)}}$$

$$\bar{S} = \frac{\sum_i{\sum_{i' < i}{\bar{p}_i * \bar{p}_{i'} * n^2 * (a_i - a_{i'})^2}}}{n * (n-1)}$$

$$R_{ST} = \frac{\bar{S} - S_w}{\bar{S}}$$

where \(d\) is the number of populations, \(j\) is an individual population, \(i\) and \(i'\) are individual alleles, \(p_{ij}\) is the frequency of an allele in a population, \(n_j\) is the number of allele copies in a population, \(a_i\) is the size of an allele expressed in number of repeat units, \(\bar{p}_i\) is an allele frequency averaged across populations (with populations weighted equally), and \(n\) is the total number of allele copies across all populations.

References

Nei, M. (1973) Analysis of gene diversity in subdivided populations. Proceedings of the National Academy of Sciences of the United States of America 70, 3321--3323.

Nei, M. and Chesser, R. (1983) Estimation of fixation indices and gene diversities. Annals of Human Genetics 47, 253--259.

Jost, L. (2008) \(G_{ST}\) and its relatives to not measure differentiation. Molecular Ecology 17, 4015--4026.

Slatkin, M. (1995) A measure of population subdivision based on microsatellite allele frequencies. Genetics 139, 457--462.

See Also

simpleFreq, deSilvaFreq

Examples

Run this code
# create a data set (typically done by reading files)
mygenotypes <- new("genambig", samples = paste("ind", 1:6, sep=""),
                   loci = c("loc1", "loc2"))
Genotypes(mygenotypes, loci = "loc1") <- list(c(206), c(208,210),
                                              c(204,206,210),
    c(196,198,202,208), c(196,200), c(198,200,202,204))
Genotypes(mygenotypes, loci = "loc2") <- list(c(130,134), c(138,140),
                                              c(130,136,140),
    c(138), c(136,140), c(130,132,136))
PopInfo(mygenotypes) <- c(1,1,1,2,2,2)
mygenotypes <- reformatPloidies(mygenotypes, output="sample")
Ploidies(mygenotypes) <- c(2,2,4,4,2,4)
Usatnts(mygenotypes) <- c(2,2)

# calculate allele frequencies
myfreq <- simpleFreq(mygenotypes)

# calculate pairwise differentiation statistics
myfst <- calcPopDiff(myfreq, metric = "Fst")
mygst <- calcPopDiff(myfreq, metric = "Gst")
myD <- calcPopDiff(myfreq, metric = "Jost's D")
myrst <- calcPopDiff(myfreq, metric = "Rst", object = mygenotypes)

# examine the results
myfst
mygst
myD
myrst

# get global statistics
calcPopDiff(myfreq, metric = "Fst", global = TRUE)
calcPopDiff(myfreq, metric = "Gst", global = TRUE)
calcPopDiff(myfreq, metric = "Jost's D", global = TRUE)
calcPopDiff(myfreq, metric = "Rst", global = TRUE, object = mygenotypes)

Run the code above in your browser using DataLab