testdistr: Graphically compare the distribution of a variable against a specific distribution

Description

This is a simple plotting function designed to help examine distributions. It also includes an option for assessing multivariate normality using the (squared) Mahalanobis distance.

Usage

testdistr(x, distr = c("normal", "beta", "chisq", "f", "gamma", "nbinom",
  "poisson", "mvnormal"), na.rm = TRUE, starts, xlim = NULL,
  varlab = "X", plot = TRUE, extremevalues = c("no", "theoretical",
  "empirical"), ev.perc = 0.005, use = c("complete.obs",
  "pairwise.complete.obs", "fiml"), robust = FALSE, rugthreshold = 500,
  seed = 1234, factor = 1, ...)

Arguments

The data as a single variable or vector to check the distribution unless the distribution is “mvnormal” in which case it should be a data frame or data table.

distr

A character string indicating the distribution to be tested. Currently one of: “normal”, “beta”, “chisq” (chi-squared), “f”, “gamma”, “nbinom” (negative binomial), “poisson”, or “mvnormal” for multivariate normal where Mahalanobis distances are calculated and compared against a Chi-squared distribution with degrees of freedom equal to the number of variables.

na.rm

A logical value whether to omit missing values. Defaults to TRUE.

starts

A named list of the starting values. Not required for all distributions. Passed on to fitdistr which fits the maximum likelihood estimates of the distribution parameters.

xlim

An optional vector to control the x limits for the theoretical distribution density line, useful when densities become extreme at boundary values to help keep the scale of the graph reasonable. Passed on to stat_function.

varlab

A character vector the label to use for the variable

plot

A logical vector whether to plot the graphs. Defaults to TRUE.

extremevalues

A character vector whether to indicate extreme values. Should be “no” to do nothing, “empirical” to show extreme values based on the observed data percentiles, or “theoretical” to show extreme values based on percentiles of the theoretical distribution.

ev.perc

Percentile to use for extreme values. For example if .01, then the lowest 1 percent and highest 1 percent will be labelled extreme values. Defaults to the lowest and highest 0.5 percent.

use

A character vector indicating how the moments (means and covariance matrix) should be estimated in the presence of missing data when distr = mvnormal. The default is to use complete observations, but full information maximum likelihood based on functions in lavaan is also available. See details.

robust

A logical whether to use robust estimation or not. Currently only applies to normally distributed data (univariate or multivariate). Also, when robust = TRUE, only complete observations are used (i.e., use = "complete.obs"). See details.

rugthreshold

Integer determining the number of observations beyond which no rug plot is added. Note that even if this threshold is exceeded, a rug plot will still be added for any extreme values (if extreme values are used and present).

seed

a random seed used to make the jitter added for Poisson and Negative Binomial distributions reproducible

factor

A scale factor fo the amount of jitter added to the QQ and Deviates plots for Poisson and Negative Binomial distributions. Defaults to 1. This results in 1 * smallest distance between points / 5 being used.

...

Additional arguments. If these include mu and sigma and the distribution is multivariate normal, then it will use the passed values instead of calculating the mean and covariances of the data.

Value

An invisible list with the ggplot2 objects for graphs, as well as information about the distribution (parameter estimates, name, log likelihood (useful for comparing the fit of different distributions to the data), and a dataset with the sorted data and theoretical quantiles.#'

Details

Note that for the use argument, several options are possible. By default it is “complete.obs”, which uses only cases with complete data on all variables. Another option is “pairwise.complete.obs”, which uses all available data for each variable indivdiually to estimate the means and variances, and all pairwise complete observation pairs for each covariance. Because the same cases are not used for all estimates, it is possible to obtain a covariance matrix that is not positive definite (e.g., correlations > +1 or < -1).

Finally, the last option is “fiml”, which uses full information maximum likelihood estimates of the means and covariance matrix. Depending on the number of cases, missing data patterns, and variables, this may be quite slow and computationally demanding.

The robust argument determines whether to use robust estimates or not when calculating densities, etc. By default it is FALSE, but if TRUE and a univariate or multivariate normal distribution is tested, then robust estimates of the means and covariance matrix (a variance if univariate) will be used based on covMcd from the robustbase package.

Examples

Run this code

# NOT RUN {
# }
# NOT RUN {
## example data
set.seed(1234)
d <- data.frame(
  Ynorm = rnorm(200),
  Ybeta = rbeta(200, 1, 4),
  Ychisq = rchisq(200, 8),
  Yf = rf(200, 5, 10),
  Ygamma = rgamma(200, 2, 2),
  Ynbinom = rnbinom(200, mu = 4, size = 9),
  Ypois = rpois(200, 4))

## testing and graphing
testdistr(d$Ybeta, "beta", starts = list(shape1 = 1, shape2 = 4))
testdistr(d$Ychisq, "chisq", starts = list(df = 8))
testdistr(d$Yf, "f", starts = list(df1 = 5, df2 = 10))
testdistr(d$Ygamma, "gamma")
testdistr(d$Ynbinom, "poisson")
testdistr(d$Ynbinom, "nbinom")
testdistr(d$Ypois, "poisson")

## compare log likelihood of two different distributions
testdistr(d$Ygamma, "normal")$Distribution$LL
testdistr(d$Ygamma, "gamma")$Distribution$LL

testdistr(d$Ynorm, "normal")
testdistr(c(d$Ynorm, 10, 1000), "normal",
  extremevalues = "theoretical")
testdistr(c(d$Ynorm, 10, 1000), "normal",
  extremevalues = "theoretical", robust = TRUE)

testdistr(mtcars, "mvnormal")

rm(d) ## cleanup
# }