poppr: Produce a basic summary table for population genetic analyses.

Description

For the poppr package description, please see package?poppr

This function allows the user to quickly view indices of heterozygosity, evenness, and linkage to aid in the decision of a path to further analyze a specified dataset. It natively takes '>genind and '>genclone objects, but can convert any raw data formats that adegenet can take (fstat, structure, genetix, and genpop) as well as genalex files exported into a csv format (see read.genalex for details).

Usage

poppr(dat, total = TRUE, sublist = "ALL", blacklist = NULL, sample = 0,
  method = 1, missing = "ignore", cutoff = 0.05, quiet = FALSE,
  clonecorrect = FALSE, strata = 1, keep = 1, plot = TRUE,
  hist = TRUE, index = "rbarD", minsamp = 10, legend = FALSE, ...)

Arguments

dat

a '>genind object OR a '>genclone object OR any fstat, structure, genetix, genpop, or genalex formatted file.

total

When TRUE (default), indices will be calculated for the pooled populations.

sublist

a list of character strings or integers to indicate specific population names (accessed via popNames). Defaults to "ALL".

blacklist

a list of character strings or integers to indicate specific populations to be removed from analysis. Defaults to NULL.

sample

an integer indicating the number of permutations desired to obtain p-values. Sampling will shuffle genotypes at each locus to simulate a panmictic population using the observed genotypes. Calculating the p-value includes the observed statistics, so set your sample number to one off for a round p-value (eg. sample = 999 will give you p = 0.001 and sample = 1000 will give you p = 0.000999001).

method

an integer from 1 to 4 indicating the method of sampling desired. see shufflepop for details.

missing

how should missing data be treated? "zero" and "mean" will set the missing values to those documented in tab. "loci" and "geno" will remove any loci or genotypes with missing data, respectively (see missingno for more information.

cutoff

numeric a number from 0 to 1 indicating the percent missing data allowed for analysis. This is to be used in conjunction with the flag missing (see missingno for details)

quiet

FALSE (default) will display a progress bar for each population analyzed.

clonecorrect

default FALSE. must be used with the strata parameter, or the user will potentially get undesired results. see clonecorrect for details.

strata

a formula indicating the hierarchical levels to be used. The hierarchies should be present in the strata slot. See strata for details.

keep

an integer. This indicates which strata you wish to keep after clone correcting your data sets. To combine strata, just set keep from 1 to the number of straifications set in strata. see clonecorrect for details.

plot

logical if TRUE (default) and sampling > 0, a histogram will be produced for each population.

hist

logical Deprecated. Use plot.

index

character Either "Ia" or "rbarD". If hist = TRUE, this will determine the index used for the visualization.

minsamp

an integer indicating the minimum number of individuals to resample for rarefaction analysis. See rarefy for details.

legend

logical. When this is set to TRUE, a legend describing the resulting table columns will be printed. Defaults to FALSE

...

arguments to be passed on to diversity_stats

Value

A data frame with populations in rows and the following columns:

Pop

A vector indicating the population factor

An integer vector indicating the number of individuals/isolates in the specified population.

MLG

An integer vector indicating the number of multilocus genotypes found in the specified population, (see: mlg)

eMLG

The expected number of MLG at the lowest common sample size (set by the parameter minsamp).

The standard error for the rarefaction analysis

Shannon-Weiner Diversity index

Stoddard and Taylor's Index

lambda

Simpson's index

E.5

Evenness

Hexp

Nei's gene diversity (expected heterozygosity)

A numeric vector giving the value of the Index of Association for each population factor, (see ia).

p.Ia

A numeric vector indicating the p-value for Ia from the number of reshufflings indicated in sample. Lowest value is 1/n where n is the number of observed values.

rbarD

A numeric vector giving the value of the Standardized Index of Association for each population factor, (see ia).

p.rD

A numeric vector indicating the p-value for rbarD from the number of reshuffles indicated in sample. Lowest value is 1/n where n is the number of observed values.

File

A vector indicating the name of the original data file.

Details

This table is intended to be a first look into the dynamics of mutlilocus genotype diversity. Many of the statistics (except for the the index of association) are simply based on counts of multilocus genotypes and do not take into account the actual allelic states. Descriptions of the statistics can be found in the Algorithms and Equations vignette: vignette("algo", package = "poppr").

sampling

The sampling procedure is explicitly for testing the index of association. None of the other diversity statistics (H, G, lambda, E.5) are tested with this sampling due to the differing data types. To obtain confidence intervals for these statistics, please see diversity_ci.

rarefaction

Rarefaction analysis is performed on the number of multilocus genotypes because it is relatively easy to estimate (Gr<U+00FC>nwald et al., 2003). To obtain rarefied estimates of diversity, it is possible to use diversity_ci with the argument rarefy = TRUE

graphic

This function outputs a ggplot2 graphic of histograms. These can be manipulated to be visualized in another manner by retrieving the plot with the last_plot command from ggplot2. A useful manipulation would be to arrange the graphs into a single column so that the values of the statistic line up:

p <-
  last_plot(); p + facet_wrap(~population, ncol = 1, scales = "free_y")

The name for the groupings is "population" and the name for the x axis is "value".

References

Paul-Michael Agapow and Austin Burt. Indices of multilocus linkage disequilibrium. Molecular Ecology Notes, 1(1-2):101-102, 2001

A.H.D. Brown, M.W. Feldman, and E. Nevo. Multilocus structure of natural populations of Hordeum spontaneum. Genetics, 96(2):523-536, 1980.

Niklaus J. Gr\"unwald, Stephen B. Goodwin, Michael G. Milgroom, and William E. Fry. Analysis of genotypic diversity data for populations of microorganisms. Phytopathology, 93(6):738-46, 2003

Bernhard Haubold and Richard R. Hudson. Lian 3.0: detecting linkage disequilibrium in multilocus data. Bioinformatics, 16(9):847-849, 2000.

Kenneth L.Jr. Heck, Gerald van Belle, and Daniel Simberloff. Explicit calculation of the rarefaction diversity measurement and the determination of sufficient sample size. Ecology, 56(6):pp. 1459-1461, 1975

Masatoshi Nei. Estimation of average heterozygosity and genetic distance from a small number of individuals. Genetics, 89(3):583-590, 1978.

S H Hurlbert. The nonconcept of species diversity: a critique and alternative parameters. Ecology, 52(4):577-586, 1971.

J.A. Ludwig and J.F. Reynolds. Statistical Ecology. A Primer on Methods and Computing. New York USA: John Wiley and Sons, 1988.

Simpson, E. H. Measurement of diversity. Nature 163: 688, 1949 doi:10.1038/163688a0

Good, I. J. (1953). On the Population Frequency of Species and the Estimation of Population Parameters. Biometrika 40(3/4): 237-264.

Lande, R. (1996). Statistics and partitioning of species diversity, and similarity among multiple communities. Oikos 76: 5-13.

Jari Oksanen, F. Guillaume Blanchet, Roeland Kindt, Pierre Legendre, Peter R. Minchin, R. B. O'Hara, Gavin L. Simpson, Peter Solymos, M. Henry H. Stevens, and Helene Wagner. vegan: Community Ecology Package, 2012. R package version 2.0-5.

E.C. Pielou. Ecological Diversity. Wiley, 1975.

Claude Elwood Shannon. A mathematical theory of communication. Bell Systems Technical Journal, 27:379-423,623-656, 1948

J M Smith, N H Smith, M O'Rourke, and B G Spratt. How clonal are bacteria? Proceedings of the National Academy of Sciences, 90(10):4384-4388, 1993.

J.A. Stoddart and J.F. Taylor. Genotypic diversity: estimation and prediction in samples. Genetics, 118(4):705-11, 1988.

Examples

Run this code

# NOT RUN {
data(nancycats)
poppr(nancycats)

# }
# NOT RUN {
# Sampling
poppr(nancycats, sample = 999, total = FALSE, plot = TRUE)

# Customizing the plot
library("ggplot2")
p <- last_plot()
p + facet_wrap(~population, scales = "free_y", ncol = 1)

# Turning off diversity statistics (see get_stats)
poppr(nancycats, total=FALSE, H = FALSE, G = FALSE, lambda = FALSE, E5 = FALSE)

# The previous version of poppr contained a definition of Hexp, which
# was calculated as (N/(N - 1))*lambda. It basically looks like an unbiased 
# Simpson's index. This statistic was originally included in poppr because it
# was originally included in the program multilocus. It was finally figured
# to be an unbiased Simpson's diversity metric (Lande, 1996; Good, 1953).

data(Aeut)

uSimp <- function(x){
  lambda <- vegan::diversity(x, "simpson")
  x <- drop(as.matrix(x))
  if (length(dim(x)) > 1){
    N <- rowSums(x)
  } else {
    N <- sum(x)
  }
  return((N/(N-1))*lambda)
}
poppr(Aeut, uSimp = uSimp)


# Demonstration with viral data
# Note: this is a larger data set that could take a couple of minutes to run
# on slower computers. 
data(H3N2)
strata(H3N2) <- data.frame(other(H3N2)$x)
setPop(H3N2) <- ~country
poppr(H3N2, total = FALSE, sublist=c("Austria", "China", "USA"), 
				clonecorrect = TRUE, strata = ~country/year)
# }

Run the code above in your browser using DataLab