samp.dist: Animated and/or snapshot representations of a statistic's sampling distribution

Description

This help page describes a series of asbio functions for depicting sampling distributions. The function samp.dist samples from a parent distribution without replacement with sample size = s.size, R times. At each iteration a statistic requested in stat is calculated. Thus a distribution of R statistic estimates is created. The function samp.dist shows this distribution as an animated anim = TRUE or non-animated anim = FALSE density histogram. Sampling distributions for up to four different statistics utilizing two different parent distributions are possible using samp.dist. Sampling distributions can be combined in various ways by specifying a function in func (see below). The function samp.dist.n was designed to show (with animation) how sampling distributions vary with sample size, and is still under development. The function samp.dist.snap creates snapshots, i.e. simultaneous views of a sampling distribution at particular sample sizes. The function dirty.dist can be used to create contaminated parent distributions.

Usage

samp.dist(parent = NULL, parent2 = NULL, biv.parent = NULL, s.size = 1, s.size2
 = NULL, R = 1000, nbreaks = 50, stat = mean, stat2 = NULL, stat3 = NULL, stat4 
 = NULL, xlab = expression(bar(x)), func = NULL, show.n = TRUE, show.SE = FALSE, 
 anim = TRUE, interval = 0.01, col.anim = "rainbow", digits = 3, ...)
samp.dist.snap(parent = NULL, parent2 = NULL, biv.parent = NULL, stat = mean, 
stat2 = NULL, stat3 = NULL, stat4 = NULL, s.size = c(1, 3, 6, 10, 20, 50), 
s.size2 = NULL, R = 1000, func = NULL, xlab = expression(bar(x)), 
show.SE = TRUE, fits = NULL, show.fits = TRUE, xlim = NULL, ylim = NULL, ...)
samp.dist.method.tck()
samp.dist.tck(statc = "mean")
samp.dist.snap.tck1(statc = "mean")
samp.dist.snap.tck2(statc = "mean")
dirty.dist(s.size, parent = expression(rnorm(1)), 
cont = expression(rnorm(1, mean = 10)), prop.cont = 0.1)
samp.dist.n(parent, R = 500, n.seq = seq(1, 30), stat = mean, xlab = expression(bar(x)), 
    nbreaks = 50, func = NULL, show.n = TRUE, 
    show.SE = FALSE, est.density = TRUE, col.density = 4, lwd.density = 2, 
    est.ylim = TRUE, ylim = NULL, anim = TRUE, interval = 0.5, 
    col.anim = NULL, digits = 3, ...)

Value

Returns a representation of a statistic's sampling distribution in the form of a histogram.

Arguments

parent: A vector or vector generating function, describing the parental distribution. Any collection of values can be used. When using random value generators for parental distributions, for CPU efficiency (and accuracy) one should use parent = expression(rpdf(s.size, ...)). Datasets exceeding 100000 observations are not recommended.
parent2: An optional second parental distribution (see parent above), useful for the construction of sampling distributions of test statistics. When using random value generators use parent2 = expression(rpdf(s.size2, ...)).
biv.parent: A bivariate (two column) distribution.
s.size: An integer defining sample size (or a vector of integers in the case of samp.dist.snap) to be taken at each of R iterations from the parental distribution.
s.size2: An optional integer defining a second sample size if a second statistic is to be calculated. Again, this will be a vector of integers in the of samp.dist.snap.
R: The number of samples to be taken from parent distribution(s).
nbreaks: Number of breaks in the histogram.
stat: The statistic whose sampling distribution is to be represented. Will work for any summary statistic that only requires a call to data; e.g. mean, var, median, etc.
stat2: An optional second statistic. Useful for conceptualizing sampling distributions of test statistics. Calculated from sampling parent2.
stat3: An optional third statistic. The sampling distribution is created from the same sample data used for stat.
stat4: An optional fourth statistic. The sampling distribution is created from the same sample data used for stat2.
xlab: X-axis label.
func: An optional function used to manipulate a sampling distribution or to combine the sampling distributions of two or more statistics. The function must contain the following arguments (although they needn't all be used in the function): s.dist, s.dist2, s.size, and s.size2. When sampling from a single parent distribution use s.dist3 in the place of s.dist2. For an estimator involving two parent distributions and four statistics, six arguments will be required: s.dist, s.dist2, s.dist3, s.dist4. s.size, and s.size2 , s.dist3, and as non-fixed arguments (see example below).
show.n: A logical command, TRUE indicates that sample size for parent will be displayed.
show.SE: A logical command, TRUE indicates that bootstrap standard error for the statistic will be displayed.
anim: A logical command indicating whether or not animation should be used.
interval: Animation speed. Decreasing interval increases speed.
col.anim: Color to be used in animation. Three changing color palettes: rainbow, gray, heat.colors, or "fixed" color types can be used.
digits: The number of digits to be displayed in the bootstrap standard error.
fits: Fitted distributions for samp.dist.snap A function with two argument: s.size and s.size2
show.fits: Logical indicating whether or not fits should be shown (fits will not be shown if no fitting function is specified regardless of whether this is TRUE or FALSE
xlim: A two element numeric vector defining the upper and lower limits of the X-axis.
ylim: A two element numeric vector defining the upper and lower limits of the Y-axis.
statc: Presets for certain statistics. Currently one of "custom", "mean", "median", "trimmed mean", "Winsorized mean", "Huber estimator", "H-L estimator", "sd", "var", "IQR", "MAD", "(n-1)S^2/sigma^2", "F*", "t* (1 sample)", "t* (2 sample)", "Pearson correlation" or "covariance".
cont: A distribution representing a source of contamination in the parent population. Used by function dirty.dist.
prop.cont: The proportion of the parent distribution that is contaminated by code.
n.seq: A range of sample sizes for samp.dist.n
est.density: A logical command for samp.dist.n. if TRUE then a density line is plotted over the histogram. Only used if fix.n = true.
col.density: The color of the density line for samp.dist.n. See est.density above.
lwd.density: The width of the density line for samp.dist.n. See est.density above.
est.ylim: Logical. If TRUE Y-axis limits are estimated logically for the animation in samp.dist.n. Consistent Y-axis limits make animations easier to visualize. Only used if fix.n = TRUE.
...: Additional arguments from plot.histogram.

Author

Ken Aho

Details

Sampling distributions of individual statistics can be created with samp.dist, or the function can be used in more sophisticated ways, e.g. to create sampling distributions of ratios of statistics, i.e. t*, F* etc. (see examples below). To provide pedagogical clarity animation for figures is provided. To calculate bivariate statistics, specify the parent distribution with biv.parent and the statistic with func (see below).

Two general uses of the function samp.dist are possible. 1) One can demonstrate the accumulation of statistics for a single sample size using animation. This is useful because as more and more statistics are acquired the frequentist paradigm associated with sampling distributions becomes better represented (i.e the number of estimates is closer to infinity). This is elucidated by allowing the default fix.n = TRUE. Animation will be provided with the default anim = TRUE. Up two parent distributions, up to two sample sizes, and up to four distinct statistics (i.e. four distinct sampling distributions, representing four distinct estimators) can be used. The arguments stat and stat3 will be drawn from parent, while stat3 and stat4 will be drawn from parent2. These distributions can be manipulated and combined in an infinite number of ways with an auxiliary function called in the argument func (see examples below). This allows depiction of sampling distributions made up of multiple estimators, e.g. test statistics. 2) One can provide simultaneous snapshots of a sampling distribution at a particular sample size with the function samp.dist.snap.

Loading the package tcltk allows use of the functions samp.dist.tck, samp.dist.method.tck, samp.dist.snap.tck1 and samp.dist.snap.tck2, which provide interactive GUIs that run samp.dist.

Examples

Run this code

if (FALSE) {
##Central limit theorem
#Snapshots of four sample sizes.
samp.dist.snap(parent=expression(rexp(s.size)), s.size = c(1,5,10,50), R = 1000)

##sample mean animation
samp.dist(parent=expression(rexp(s.size)), col.anim="heat.colors", interval=.3)

##Distribution of t-statistics from a pooled variance t-test under valid and invalid assumptions
#valid
t.star<-function(s.dist1, s.dist2, s.dist3, s.dist4, s.size = 6, s.size2 = 
s.size2){
MSE<-(((s.size - 1) * s.dist3) + ((s.size2 - 1) * s.dist4))/(s.size + s.size2-2)
func.res <- (s.dist1 - s.dist2)/(sqrt(MSE) * sqrt((1/s.size) + (1/s.size2)))
func.res}

samp.dist(parent = expression(rnorm(s.size)), parent2 = 
expression(rnorm(s.size2)), s.size=6, s.size2 = 6, R=1000, stat = mean, 
stat2 = mean, stat3 = var, stat4 = var, xlab = "t*", func = t.star)

curve(dt(x, 10), from = -6, to = 6, add = TRUE, lwd = 2)
legend("topleft", lwd = 2, col = 1, legend = "t(10)")

#invalid; same population means (null true) but different variances and other distributional 
#characteristics.
samp.dist(parent = expression(runif(s.size, min = 0, max = 2)), parent2 = 
expression(rexp(s.size2)), s.size=6, s.size2 = 6, R = 1000, stat = mean, 
stat2 = mean, stat3 = var, stat4 = var, xlab = "t*", func = t.star)

curve(dt(x, 10),from = -6, to = 6,add = TRUE, lwd = 2)
legend("topleft", lwd = 2, col = 1, legend = "t(10)")

## Pearson's R
require(mvtnorm)
BVN <- function(s.size) rmvnorm(s.size, c(0, 0), sigma = matrix(ncol = 2, 
nrow = 2, data = c(1, 0, 0, 1)))
samp.dist(biv.parent = expression(BVN(s.size)), s.size = 20, func = cor, xlab = "r")
                                                  
#Interactive GUI, require package 'tcltk'
samp.dist.tck("S^2")
samp.dist.snap.tck1("Huber estimator")
samp.dist.snap.tck2("F*")
}

Run the code above in your browser using DataLab